MSFT_CF

.NET on Cloud Foundry

As a consultant on the Cloud Foundry platform I regularly get asked if CF can host .NET applications. The answer is yes. However, it depends on the application how much we as platform engineering have to do to make it possible. Chances are, you don’t have to do anything special. That chance is however quite low as I’ll explain below.

Note that I wrote a post on the same topic some 2 years ago. Now that Diego, .NET Core, and Concourse have all gained production status it’s time to see how the dust has settled.

The old and the new

The .NET Framework we have become used over the last 16 years or so is at version 4.6.x. It’s is essentially single platform (Windows), closed source, installed and upgraded as part of the OS, has a large footprint and is not especially fast.
Microsoft realized at some point this just wouldn’t do anymore in the modern cloud era in which frameworks are developed as open source, without explicit OS dependencies, and applications are typically deployed as a (containerized) set of lightweight services that are packaged together with their versioned dependencies (libraries and application runtime). Some time after, the world saw the first alphas and betas of .NET Core, and on June 27th 2016 it reached GA with version 1.0.0. This made lots of people very happy and was generally seen as a good move (albeit quite late).
While Microsoft is still actively developing both the legacy .NET Framework as well as .NET Core, it was made pretty clear .NET Core is the future.

So what about apps?

ASP.NET Core in a Nutshell

ASP.NET Core in a Nutshell


An application written as an (ASP).NET Core app will run on the old and the new – although sometimes it needs some community convincing to keep it that way. The opposite is not the case: many Windows specific/Win32 APIs are for obvious reasons not available on the cross-platform .NET Core runtime, so legacy .NET apps taking a dependency on these APIs can not be run on .NET Core without refactoring.
Note this dependency doesn’t have to be explicit: it’s about the whole dependency chain. For instance, the popular Entity Framework ORM library takes a dependency on ADO.NET which is highly dependent on Win32, and so can not be used. Instead, applications using it should be rewritten to use the new EF Core library.

New is easy – .NET Core

From a platform engineering perspective supporting .NET core is easy. As .NET core can run in a container on Linux, it follows the default hosting model of CF. So you just install/accept the dotnetcore buildpack and off you go.

Old is hard – .NET Framework

Of course you can attempt to convince your developers they have to port their code to .NET Core. However, your mileage may vary. Since legacy is what makes you money today, a large existing .NET codebase that’s the result of years of engineering can’t be expected to be rewritten overnight. And if rarely updated, it will be very hard to make a business case for it even if do you have the resources.
Instead, a more realistic scenario is a minimal refactoring in such a way that the vast majority of the never touched cold code can stay on .NET Framework, while all the new code together with often changed hot code can be written in .NET Core.

It needs Windows – Garden has your back

Cloud Foundry before 2015 used Warden containers, which took a hard dependency on Linux. The rewrite of the DEA component of Cloud Foundry in Go, resulting in DEA-Go Diego was covered quite a lot online. For .NET support, the accompanying rewrite of Warden in Go – resulting in Garden *badum tish* is much more interesting since Garden is a platform independent API for containerization. So what we need for a Windows Diego Cell is:

  • a Windows Garden backend – to make CF provision workloads on the VM
  • the BOSH agent for Windows – to manage the VM in the same way all of Cloud Foundry is managed on an infrastructure level

We need to package all this in a template VM stemcell so BOSH can use it. You can find the recipe for doing this, and some automation scripts here. Even with the scripts, it’s a lot of cumbersome, time consuming, and error prone work, so you best automate it. I’ll discuss in my next post how I did that using a pipeline in Concourse CI.

If you are on a large public cloud like Azure, GCP or AWS, and use Pivotal Cloud Foundry, Pivotal has supported stemcells ready for download. If you are on a private cloud, or not using PCF, you have to roll your own. I’m not sure why Pivotal doesn’t offer vSphere or OpenStack Windows stemcells, but I can imagine it has something to do legal (think Microsoft, licensing and redistribution).

Final steps

PCF Runtime for Windows

PCF Runtime for Windows


Once you have the stemcell you need to do a few things:

Again, if you are using PCF, Pivotal has you covered, and you can download and install the PCF Windows Runtime tile which takes care of both of the above. If you are on vanilla CF, you have to do some CLI magic yourselves.

.NET on Pivotal Cloud Foundry

In my latest post, I tested Lattice.cf, the single VM smaller brother of Pivotal Cloud Foundry (PCF). Considering a full installation of PCF has a footprint of about 25 Virtual Machines (VM) requiring a total of over 33Gb RAM, 500+ Gb storage, and some serious CPU power, it’s not hard to see why Lattice is more developer friendly. However, that wasn’t my real motivation to try it out: more important was Lattice’s incorporation of the new elastic runtime architecture codename ‘Diego‘ which will replace the current Droplet Execution Agent (DEA) based runtime in due time.

For me, the main reasons to get excited about Diego are two-fold:

  • Diego can run Docker containers: I demoed this in my latest post
  • Diego can run Windows (.NET) based workloads

In this post, I’ll demo the Windows based workloads by deploying an ASP.NET MVC application, which uses the full-fledged production ready .NET stack and requires a Windows kernel – as opposed to .NET Core which is cross-platform, runs on a plentitude of OSes, and is very exciting, but not production ready yet.

Diego on PCF

At this point we have to resort to PCF as Lattice can’t run Windows workloads. This is because Lattice’s strong point (all services in 1 VM) is also its weakness: since Lattice incorporates all required services in a single Linux VM it quite obviously loses the ability to schedule Windows workloads.

Let’s take a quick look at the Diego architecture overview:

Diego architecture overview

Diego architecture overview

Diego consists of a ‘brain’ and a number of ‘cells’. These cells run a container backend implementing Garden – a platform-neutral API for containerization (and the Go version of Warden). The default backend – garden-linux – is Linux based and compatible with various container technologies found on Linux (including Docker).

As soon as we run the various services in the overview on seperate VMs (as we do on PCF), it becomes possible to provision a Windows cell. ‘All’ we need is a Windows backend for garden and the various supporting services, and we should be good to go. Preferably in the form of a convenient installer.

One problem remains: we still need the Diego runtime in Pivotal Cloud Foundry. Kaushik Bhattacharya at Pivotal supplied me with an internal beta of ‘Diego for PCF’ (version 0.2) which I could install in my Pivotal on VMware vSphere lab environment. Installed, the Ops Manager tile looks like this:

Diego for PCF tile

Diego for PCF tile

And the default VMs and resources that come with Diego for PCF (notice the single default cell which is a garden-linux cell):

Diego for PCF default resources

Diego for PCF default resources

garden-windows cell

To provision a Windows cell, we have to create a new VM manually and install a Windows server OS, and run a setup powershell script as well as the installer with the correct configuration. I’ll skip the details here, but when all is done, you can hit the receptor service url (in my case https://receptor.system.cf.lab.local/v1/cells), and it should show 2 cells: 1 installed as part of Diego for PCF, as well as the one we just created:

[
    {
        "cell_id": "DiegoWinCell",
        "zone": "d48224618511a8ac44df",
        "capacity": {
            "memory_mb": 16383,
            "disk_mb": 40607,
            "containers": 256
        }
    },
    {
        "cell_id": "cell-partition-d48224618511a8ac44df-0",
        "zone": "d48224618511a8ac44df",
        "capacity": {
            "memory_mb": 16049,
            "disk_mb": 16325,
            "containers": 256
        }
    }
]

Intermezzo: Containerization on Windows?

There must be some readers who are intrigued by the garden-windows implementation by now. After all, since when does Windows have support for containers? In fact, Microsoft has announced container support in the next Server OS, and the Windows Server 2016 Technical Preview 3 with container support was just released. However, this is not the ‘containerization’ used in the current version of garden-windows.

So how does it work?

Some of you may know Hostable Web Core: an API by which you can load a copy of IIS inside your process. So what happens when you push an app to Cloud Foundry and select the windows stack, is that app is hosted inside a new process on the cell, in which it gets its own copy of IIS after which it’s started.

I know what you’re thinking by now: “but that’s not containerization at all”. Indeed, strictly speaking it isn’t: it doesn’t use things like cgroups and namespaces used by Linux (and future Windows) container technologies in order to guarantee container isolation. However, from the perspective of containers as ‘shipping vehicles for code’ it’s very much containerization, as long as you understand the security implications.

Deploying to the Windows cell

Deployment to the Windows cell isn’t harder than to default cells, however, there are a couple of things to keep in mind:

  • as the windows stack isn’t the default, you have to specify it explicitly
  • as for now the DEA mode of running tasks is still default, you have to enable Diego support explicitly

Using the Diego Beta CLI the commands to push a full .NET stack demo MVC app are as follows (assuming you cloned it from github):

cf push diegoMVC -m 2g -s windows2012R2 -b https://github.com/ryandotsmith/null-buildpack.git --no-start -p ./
cf enable-diego diegoMVC
cf start diegoMVC

After pushing and scaling the Pivotal Apps Manager:

Pivotal CF Apps Manager with DiegoMVC .NET  app

Pivotal CF Apps Manager with DiegoMVC .NET app

And the app itself:

DiegoMVC .NET application on Windows cell on Pivotal CF

DiegoMVC .NET application on Windows cell on Pivotal CF

Summary

Diego for PCF is still in internal beta, but soon Pivotal Cloud Foundry will have support for running applications using the full .NET stack.

Useful null reference exceptions

As a .NET developer you’re guaranteed to have at some point run into the “Object reference not set to instance of an Object” message, or NullReferenceException. Encountering one without exception means there is a logic problem in your code, but finding what the supposed “instance of an Object” or target was can be quite hard. After all, the target was not assigned to the reference, so how do we know what should have been there ? In fact we can’t, as Eric Lippert explains.

However, just because we don’t know the target doesn’t mean we don’t know anything: in most cases there is some information we can obtain like:

  • type info: the type of the reference, and so the type of the target or at least what it inherits from (baseclass) or what it implements (interface)
  • operation info: the type of operation that caused the reference to be dereferenced

For example, in case the operation is a method call (callvirt), we know the C# compiler only compiles if it knows the methods are supported by the supposed target (through inheritance or directly), so with the info about the intended method we would have a good hint about what object is null.

According to MSFT, some other instructions that can throw NullReferenceException are: “The following Microsoft intermediate language (MSIL) instructions throw NullReferenceException: callvirt, cpblk, cpobj, initblk, ldelem.<type>, ldelema, ldfld, ldflda, ldind.<type>, ldlen, stelem.<type>, stfld, stind.<type>, throw, and unbox“. The amount of useful information will depend on the IL instruction context. However, currently none of this information is in the NullReferenceException: it just states some object reference was null, it was dereferenced, and in what method, nothing more, deal with it.

If you’re running debug code under Visual Studio this isn’t so much of a problem, as the debugger will break as soon as the dereferencing occurs (‘first chance’) and show you the source, but what about production code without source? Sure you can catch the exception and log it, but between the time it is thrown and the moment a catch handler is found, the runtime (CLR) is in control. At the time we are back in user code inside our catch handler, all we have to go on is the information in the NullReferenceException itself, which is essentially nothing. Debugging pros can attach WinDbg+SOS to a crashdump, but when the crash is the result of a ‘second chance exception’ (no exception handler was found) we’re at an even later stage. What we really want is a tool which can attach to the production process like a debugger, and get more info about first chance exceptions as they are thrown, as that’s the moment when all the info is available.

To give you an idea of the available info, the code below periodically throws a series of NullReferenceExceptions with a very different origin:

using System;
using System.Threading;

namespace TestNullReference
{
    interface TestInterface
    {
        void TestCall();
    }

    class TestClass
    {
        public int TestField;

        public void TestCall() { }
    }

    class TestClass2 : TestClass
    {   
    }

    class Program
    {
        #region methods to invoke null reference exceptions for various IL opcodes

        /// <summary>
        /// IL throw
        /// </summary>
        static void Throw()
        {
            throw null;
        }

        /// <summary>
        /// IL callvirt on interface
        /// </summary>
        static void CallVirtIf()
        {
            TestInterface i = null;
            i.TestCall();
        }

        /// <summary>
        /// IL callvirt on class
        /// </summary>
        static void CallVirtClass()
        {
            TestClass i = null;
            i.TestCall();
        }

        /// <summary>
        /// IL callvirt on inherited class
        /// </summary>
        static void CallVirtBaseClass()
        {
            TestClass2 i = null;
            i.TestCall();
        }

        /// <summary>
        /// IL ldelem
        /// </summary>
        /// <param name="a"></param>
        static void LdElem()
        {
            int[] array = null;
            var firstElement = array[0];
        }

        /// <summary>
        /// IL ldelema
        /// </summary>
        /// <param name="a"></param>
        static unsafe void LdElemA()
        {
            int[] array = null;
            fixed (int* firstElementA = &(array[0]))
            {                
            }
        }

        /// <summary>
        /// IL stelem
        /// </summary>
        /// <param name="a"></param>
        static void StElem()
        {
            int[] array = null;
            array[0] = 3;
        }

        /// <summary>
        /// IL ldlen
        /// </summary>
        /// <param name="a"></param>
        static void LdLen()
        {
            int[] array = null;
            var len = array.Length;
        }

        /// <summary>
        /// IL ldfld
        /// </summary>
        /// <param name="a"></param>
        static void LdFld()
        {
            TestClass c = null;
            var fld = c.TestField;
        }

        /// <summary>
        /// IL ldflda
        /// </summary>
        /// <param name="a"></param>
        static unsafe void LdFldA()
        {
            TestClass c = null;
            fixed (int* fld = &(c.TestField))
            {
                
            }
        }

        /// <summary>
        /// IL stfld
        /// </summary>
        /// <param name="a"></param>
        static void StFld()
        {
            TestClass c = null;
            c.TestField = 3;
        }

        /// <summary>
        /// IL unbox_any
        /// </summary>
        static void Unbox()
        {
            object o = null;
            var val = (int) o;
        }

        /// <summary>
        /// IL ldind
        /// </summary>
        static unsafe void LdInd()
        {
            int* valA = null;
           
            var val = *valA;
        }

        /// <summary>
        /// IL ldind
        /// </summary>
        static unsafe void StInd()
        {
            int* valA = null;

            *valA = 3;
        }
        #endregion

        static void LogNullReference(Action a)
        {
            try
            {
                a();
            }
            catch (NullReferenceException ex)
            {
                var msg = string.Format("NullReferenceException executing {0} : {1}", a.Method.Name, ex.Message);
                Console.WriteLine(msg);
            }
        }

        static void Main(string[] args)
        {
            while (!Console.KeyAvailable)
            {
                LogNullReference(Throw);

                LogNullReference(CallVirtIf);
                LogNullReference(CallVirtClass);
                LogNullReference(CallVirtBaseClass);

                LogNullReference(LdElem);
                LogNullReference(LdElemA);
                LogNullReference(StElem);
                LogNullReference(LdLen);

                LogNullReference(LdFld);
                LogNullReference(LdFldA);
                LogNullReference(StFld);

                LogNullReference(Unbox);

                LogNullReference(LdInd);
                LogNullReference(StInd);

                Thread.Sleep(2000);   
            }           
        }
    }
}

All 14 of them will give us the dreaded “Object reference not set to an instance of an object” message.

Now what happens if we attach a tracing tool that gets as much info as possible:

Attempted to throw an uninitialized exception object. In static void TestNullReference.Program::Throw() cil managed  IL 1/1 (reported/actual).
Attempted to call void TestNullReference.TestInterface::TestCall() cil managed  on an uninitialized type. In static void TestNullReference.Program::CallVirtIf() cil managed  IL 3/3 (reported/actual).
Attempted to call void TestNullReference.TestClass::TestCall() cil managed  on an uninitialized type. In static void TestNullReference.Program::CallVirtClass() cil managed  IL 3/3 (reported/actual).
Attempted to call void TestNullReference.TestClass::TestCall() cil managed  on an uninitialized type. In static void TestNullReference.Program::CallVirtBaseClass() cil managed  IL 3/3 (reported/actual).
Attempted to load elements of type System.Int32 from an uninitialized array. In static void TestNullReference.Program::LdElem() cil managed  IL 3/4 (reported/actual).
Attempted to load elements of type System.Int32 from an uninitialized array. In static void TestNullReference.Program::LdElemA() cil managed  IL 3/4 (reported/actual).
Attempted to store elements of type System.Int32 in an uninitialized array. In static void TestNullReference.Program::StElem() cil managed  IL 3/5 (reported/actual).
Attempted to get the length of an uninitialized array. In static void TestNullReference.Program::LdLen() cil managed  IL 3/3 (reported/actual).
Attempted to load non-static field int TestNullReference.TestClass::TestField from an uninitialized type. In static void TestNullReference.Program::LdFld() cil managed  IL 3/3 (reported/actual).
Attempted to load non-static field int TestNullReference.TestClass::TestField from an uninitialized type. In static void TestNullReference.Program::LdFldA() cil managed  IL 3/3 (reported/actual).
Attempted to store non-static field int TestNullReference.TestClass::TestField in an uninitialized type. In static void TestNullReference.Program::StFld() cil managed  IL 3/4 (reported/actual).
Attempted to cast/unbox a value/reference type of type System.Int32 using an uninitialized address. In static void TestNullReference.Program::Unbox() cil managed  IL 3/3 (reported/actual).
Attempted to load elements of type System.Int32 indirectly from an illegal address. In static void TestNullReference.Program::LdInd() cil managed  IL 4/4 (reported/actual).
Attempted to store elements of type System.Int32 indirectly to a misaligned or illegal address. In static void TestNullReference.Program::StInd() cil managed  IL 4/5 (reported/actual).

You can download and play with the tool already. Below I’ll shed some light on how this info can be obtained.

What the tracer does

One blog post is not enough to fully explain how to write a managed debugger. However, enough has been written about how to leverage the managed debugging API so for this post I’m going to assume we’ve attached a managed debugger to the target process, implemented debugger callback handlers, hooked them up and are handling exception callbacks.

The exception callback has the following signature:

HRESULT Exception (
    [in] ICorDebugAppDomain   *pAppDomain,
    [in] ICorDebugThread      *pThread,
    [in] ICorDebugFrame       *pFrame,
    [in] ULONG32              nOffset,
    [in] CorDebugExceptionCallbackType dwEventType,
    [in] DWORD                dwFlags
);

The actual exception can be obtained from the thread as an ICorDebugReferenceValue which can be dereferenced to an ICorDebugObjectValue of which you can ultimately get the ICorDebugClass and metadata token (mdTypeDef). To find out if this exception is a NullReferenceException, you can either look up this token using the metadata APIs, or compare it to a prefetched metadata token.

When we know we’re dealing with a 1st chance null reference exception, we can dig deeper and try to find out the offending IL instruction. From nOffset, we already have the IL offset in the method frame’s code. The code itself can be obtained by querying the ICorDebugFrame for an ICorDebugILFrame interface, and requesting it for its code (ICorDebugCode2), which has a method for retreiving the actual IL bytes.

Depending on the IL instruction we find at nOffset in the IL bytes, we can get various details and log them.

For the instructions that can throw:

  • callvirt: a call to a known instance method (mdMethodDef) on an uninitialized type
  • cpblk, cpobj, initblk: shouldn’t happen (not exposed by C#)
  • ldelem.<type>, ldelema, stelem.<type>: an attempt to load/store elements of a known type (mdTypeDef) from/to an uninitialized array
  • ldfld, ldflda, stfld: an attempt to load/store a known non-static field (mdFieldDef) of a known uninitialized type
  • ldind.<type>, stind.<type>: an invalid address was passed to the load instruction, or a misaligned address was passed to the store instruction (shouldn’t happen as this would be a compiler instead of user code bug)
  • ldlen: an attempt to get the length of an uninitialized array
  • throw: an attempt to throw an uninitialized exception object
  • unbox, unbox_any: an attempt to cast/unbox a value/reference type of a known type (mdTypeDef) using an uninitialized address

The various metadata tokens can be looked up using the metadata APIs mentioned before, and finally formatted into a nice message.

Creating an automatic self-updating process

Recently I was asked by a client to replace a single monolithic custom workflow engine with a more scaleable and loosely coupled modern alternative. We decided on a centralized queue which contained the work items and persisted them, with a manager (scheduler) on top which would accept connections of a dynamically scaleable number of processors which would request and then do the actual work. It’s an interesting setup in itself which relies heavily on dependency injection, Command-Query-Seperation, Entity Framework code first with Migrations for the database, and code first WCF for a strongly typed communication between the scheduler and its processors.

Since there would be many Processors without an administration of where they would be installed, one of the wishes was to make them self-update at runtime when new versions of the code would be available.

Detecting a new version

A key component of the design is for the processors to register themselves on the scheduler when they come start. In the same spirit, they could call to an updatemanager service periodically to check for updates. I implemented this by placing a version inside the processor primary assembly (in the form of an embedded resource). The update manager returns the current latest available version and download location. If this version is more recent than the built in version, the decision to update can be made.

This completes the easy part.

Updating

The problem with updating a process in-place at runtime, is that the operating system locks executable images (exe/dll) when they are mapped inside a running processes. So when you try to overwrite them, you get ‘file is in use by another process’ errors. The natural approach would therefore be to unload every non-OS library except the executable itself, followed by the overwrite action and subsequent reload.

In fact this works for native code/processes, however managed assemblies once loaded can not be unloaded. It therefore appears we are out of luck and can’t use this method. However, we have a (brute force) escape: while we can’t unload managed assemblies, we can unload the whole AppDomain they have been loaded into.

Updating: managed approach

The idea therefore becomes to spin up the process with almost nothing in the default AppDomain (which can never be unloaded), and from there spawn a new AppDomain with the actual Processor code. If an update is detected, we can unload, update, and respawn it again.

And still it didn’t work…the problem I ran into now is that somehow the default domain persisted in loading the one of the user defined assemblies. I loaded my new AppDomain with the following lines:

public class Processor : MarshalByRefObject
{
    AppDomain _processorDomain;

    public void Start()
    {
       // startup code here...
    }

    public static Processor HostInNewDomain()
    {
        // Setup config of new domain to look at parent domain app/web config.
        var procSetup = AppDomain.CurrentDomain.SetupInformation;
        procSetup.ConfigurationFile = procSetup.ConfigurationFile;

        // Start the processor in a new AppDomain.
        _processorDomain = AppDomain.CreateDomain("Processor", AppDomain.CurrentDomain.Evidence, procSetup);

        return (Processor)domain.CreateInstanceAndUnwrap(Assembly.GetExecutingAssembly().FullName, typeof(Processor).FullName);
    }
}

and in a seperate assembly:

public class ProcessorHost
{
    Processor _proc;

    public void StartProcessor()
    {
        proc = Processor.HostInNewDomain();
        proc.Start();
    }
}

There are several problems in this code:

  • the Processor type is used inside the default AppDomain in order to identify the assembly and type to spawn in there – this causes the assembly which contains the type to get loaded in the default domain as well.
  • after spawning the new AppDomain, we call into the Processor.Start() to get it going. For the remoting to work, the runtime generates a proxy inside the default domain to get to the Processor (MarshalByRefObject) in the Processor domain. It does so by loading the type from the assembly containing the Processor type and reflecting on that. I tried different approaches (reflection, casting to dynamic), but it seems the underlying mechanism to generate the proxy is always the same.

So what is the solution ? For one we can make it autostart by starting all the action in the constructor of the Processor. That way we don’t need to call anything to start the Processor, so the runtime doesn’t generate a proxy. Moreover, we can take a stringly typed dependency on the assembly and type. This will result in the code above to change to:

public class Processor : MarshalByRefObject
{
    public Processor()
    {
        Start();
    }

    public void Start()
    {
        // startup code here....
    }
}

with in a seperate assembly:

public class ProcessorHost
{
    private const string ProcessorAssembly = "Processor, Version=1.0.0.0, Culture=neutral, PublicKeyToken=null";
    private const string ProcessorType = "Processor.Processor";

    AppDomain _processorDomain;
    ObjectHandle _hProcessor;

    public void Start()
    {
        // Setup config of new domain to look at parent domain app/web config.
        var procSetup = AppDomain.CurrentDomain.SetupInformation;
        procSetup.ConfigurationFile = procSetup.ConfigurationFile;

        // Start the processor in a new AppDomain.
        _processorDomain = AppDomain.CreateDomain("Processor", AppDomain.CurrentDomain.Evidence, procSetup);

        // Just keep an ObjectHandle, no need to unwrap this.
        _hProcessor = _processorDomain.CreateInstance(ProcessorAssembly, ProcessorType);
    }
}

Communicating with the new AppDomain

Above I circumvented the proxy generation (and thereby type assembly loading in the default AppDomain) by kicking off the startup code automatically in the Processor constructor. However, this restriction introduces a new problem: since we cannot ever call into or out of the new domain by going through user defined types, as that would cause user defined assemblies to be locked in place, how then do we communicate to the parent/default domain an update is ready ?

For the moment I do this by writting AppDomain data in the Processor domain – AppDomain.SetData(someKey, someData) – and reading it periodically from the parent domain – AppDomain.GetData(someKey). It’s not ideal as it requires polling, but it at least works: I only use standard framework methods and types, and so the update works.

down-the-rabbit-hole

Down the rabbit hole: JIT Cache

Some weeks ago I was working on a heavyweight .NET web application and got annoyed by the fact that I needed to wait forever to see changes in my code at work. I don’t have a slow machine, and where the initial compilation of the .NET/C# was pretty fast, the loading – dominated by Just-In-Time compilation (JIT) – took forever. It routinely took 30-60 seconds to come up, and most of that time was spent on loading images and JITing them.

That the JIT can take time is a well known fact, and there are some ways to alleviate the burden. For one there is the Native Image Generator (NGen), a tool shipped with the .NET framework since v1.1. It allows you to pre-JIT entire assemblies and install them in a system cache for later use. Other more recent developments are the .NET Native toolchain (MSFT) and SharpLang / C# Native (community), which leverage the C++ compiler instead of the JIT compiler to compile .NET (store) apps directly to native images.

However great these techniques are, they are designed with the idea ‘pay once, profit forever’ in mind. A good idea for production releases, but they won’t solve my problem; if I change a single statement in my code and rebuild using these tools, it will increase the time I have to wait instead of reduce it due to the more extensive Common Intermediate Language (CIL) to native compilation.

The idea

An alternative for the scenario described above would be to keep a system global cache of JITted code (methods), only invoking the actual JIT for code that isn’t in the cache yet.

Requirements

  • Interceptor: a mechanism to hook calls from the Common Language Runtime (CLR) to the JIT compiler. We need this to be able to introduce our own ‘business’ logic (caching mechanism) in this channel.
  • Injector: a mechanism to load the Interceptor into a remote .NET process. We need this to hook every starting .NET program: most JIT compilation is done during startup, so loading the interception code should take place at that time for maximum profit.
  • Cache: the actual business logic. A smart cache keeping track of already JITted code and validity.

Note: in this article I’m going to discuss the first 2 which are the technical challenge. The actual cache will be a topic of a future article, and in all honesty I’m not sure it’s going to work. The awesome involved in the first two items was worth the effort already.

 

Interceptor

In the desktop version of .NET, the CLR and JIT are two separate libraries – clr.dll and clrjit.dll/protojit.dll respectively – which get loaded in every .NET process. I started from the very simple assumption that the CLR calls into some function export of the clrjit library. When I checked out the public exports of clrjit though, there are only 2:

clrjit_exports

I called the getJit and got back something which was likely a pointer to some C++ class, but was unable to figure out what to do with it or to disseminate the methods and required arguments, so my best guess was googling for ‘jit’, ‘getJit’, ‘hook’ etc.

I found a brilliant article by Daniel Pistelli, who identified the returned value from clrjit!getJit as an implementation of Rotor‘s ICorJitCompiler. The Rotor project (officially: SSCLI) is the closest thing we non-MSFT people have to the sources of the native parts of the .NET ecosystem (runtime, jit, GC). However, MSFT made it very clear it was only a demonstration project: it wasn’t the actual .NET source. Moreover, the latest release is from 2006. In his article, Daniel found that he could use the Rotor source headers to work with the production .NET version of the JIT: the vtable in the .NET desktop implementation is more extensive, but the  first entries are identical.

For the full details I’ll refer you to his article, but once operational this is enough for us to intercept and wrap requests for compilation to the JIT compiler with our own function with the signature:

int __stdcall compileMethod(ULONG_PTR classthis, ICorJitInfo *comp, CORINFO_METHOD_INFO *info, unsigned flags, BYTE **nativeEntry, ULONG  *nativeSizeOfCode)

 Injector

Most JIT compilation takes place during process start, so to not miss anything, we have to  find a way to hook the JIT in a very early stage. There are two methods I explored: for the first I self-hosted the CLR. Using the unmanaged hosting APIs it’s possible to write a native application in which you have much more control over process execution. For instance you can first load the runtime, which will automatically load the JIT compiler as well, insert your custom hook next, and only then start executing managed code. This will ensure you don’t miss a bit.

An example trace from a self-hosted CLR trivial console app with JIT hooking:

jitlog

Note that the JIT is hooked before managed execution starts.

However, this method has a downside, namely that it will only work for processes we start explicitly using our native loader. Any .NET executable started in the regular way will escape our hooking code. What we really want is to load our hook at process start in every .NET process. For this we need a couple of things:

  1. process start notifications – to detect a new process start
  2. remote code injection – to load our hooking code into the newly started process

It turns out both are possible, but to do so we have to dive into the domain of kernel mode, and remote code injection. For fun, try and enter those keywords in a search engine and see how many references to ‘black hat’, ‘malicious’, ‘rootkit’, ‘security exploits’ etc you find. Clearly, the methods I want to use have some attraction on a whole different kind of audience as well.

Anyway, I still want it so down the rabbit hole we go.

1. Process start notifications

We can register a method for receiving process start/quit notifications by calling PsSetCreateProcessNotifyRoutine. This function is part of the kernel mode APIs, and to access them we have to write a kernel driver. When you download and install the Windows Driver Kit, which integrates with Visual Studio, you get standard templates for writing drivers, which I strongly advice you to use, because writing a driver from scratch is not especially hard, but it is very troublesome as any bug or bugcheck hit will make your process (a.k.a. the kernel) crash, so better to start from some tested scaffold. When testing the driver I did so in a new VM, which was a good foresight as I fried it a couple of times, making it completely unbootable.

Anyway, back to the code. To register the notification routine we have to call PsSetCreateProcessNotifyRoutine during Driver Entry:

HANDLE hNotifyEvent;
PKEVENT NotifyEvent = NULL;
unsigned long lastPid;

VOID NotifyRoutine(_In_ HANDLE parentId, _In_ HANDLE processId, _In_ BOOLEAN Create)
{
    UNREFERENCED_PARAMETER(parentId);

    if (Create)
    {
        DbgPrint("Execution detected. PID: %d", processId);

        if (NotifyEvent != NULL)
        {
            lastPid = (unsigned long)processId;
            KeSetEvent(NotifyEvent, 0, FALSE);
        }
    }
    else
    {
        DbgPrint("Termination detected. PID: %d", processId);
    }
}

VOID OnUnload(IN PDRIVER_OBJECT DriverObject)
{
    UNREFERENCED_PARAMETER(DriverObject);

    // remove notify callback
    PsSetCreateProcessNotifyRoutine(NotifyRoutine, TRUE);
}

NTSTATUS DriverEntry(_In_ PDRIVER_OBJECT DriverObject, _In_ PUNICODE_STRING RegistryPath)
{
    // Create an event object to signal when a process started.
    DECLARE_CONST_UNICODE_STRING(NotifyEventName, L"\\NotifyEvent");
    NotifyEvent = IoCreateSynchronizationEvent((PUNICODE_STRING)&NotifyEventName, &hNotifyEvent);
    if (NotifyEvent == NULL)
        return STATUS_UNSUCCESSFUL;
    KeClearEvent(NotifyEvent);

    // boiler plate code omitted

    WdfDriverCreate(DriverObject, RegistryPath, &attributes, &config, WDF_NO_HANDLE);

    PsSetCreateProcessNotifyRoutine(NotifyRoutine, FALSE);

    DriverObject->DriverUnload = OnUnload; // omitting this will hang the system
}

You can see we also register a kernel event object which will be signaled every time a process start notification is received, as we will need this later.

Once process start  notifications were going, I explored ways to also do part 2 (remote code injection) from kernel mode, but decided against it for two reasons: the kernel mode APIs, while offering some very powerful and low level access to the machine and OS, are very limited (you cannot access regular win32 APIs), so it’s much easier and faster to develop in user mode. And second, I got bored of restoring yet another fried VM.

1b. Getting notifications to user mode

So I needed an always running component in user mode which communicates with the kernel mode driver: the ideal use case for a Windows service. By default kernel driver objects aren’t accessible in user mode. To access it you have to expose them in the (kernel) object directory as a ‘Dos Device’. Adding a symbolic link like \DosDevices\InterceptorDriver to the actual driver object – using WdfDeviceCreateSymbolicLink – is sufficient to access it by name in user mode (full path: \\.\InterceptorDriver).

Just open it like a file:

HANDLE hDevice = CreateFileW(
    L"\\\\.\\InterceptorDriver", // driver to open
    0,                           // no access to driver
    FILE_SHARE_READ | FILE_SHARE_WRITE, // share mode
    NULL,                        // default security attributes
    OPEN_EXISTING,               // disposition
    0,                           // file attributes
    NULL);                       // do not copy file attributes

For the actual communication the preferred way is using IOCTL: in user mode you can send an IO control code to the driver:

DWORD pId;
DWORD junk = 0;
BOOL bResult = DeviceIoControl(hDevice, // device to be queried
               IOCTL_PROCESS_NOTIFYNEW, // operation to perform
               NULL, 0,                 // no input buffer
               &pId, sizeof(pId),       // output buffer
               &junk,                   // # bytes returned
               (LPOVERLAPPED)NULL);     // synchronous I/O

The driver itself has to handle the code:

VOID InterceptorDriverEvtIoDeviceControl(_In_ WDFQUEUE Queue, _In_ WDFREQUEST Request, _In_ size_t OutputBufferLength, _In_ size_t InputBufferLength, _In_ ULONG IoControlCode)
{
    NTSTATUS status = STATUS_SUCCESS;
    size_t bytesReturned = 0;
    switch (IoControlCode)
    {
    case IOCTL_PROCESS_NOTIFYNEW:
    {
        if (NotifyEvent == NULL)
            break;

        // Set a finite timeout to allow service shutdown (else thread is stuck in kernel mode).
        LARGE_INTEGER timeOut;
        timeOut.QuadPart = -10000 * 1000; // 100 ns units.
        status = KeWaitForSingleObject(NotifyEvent, Executive, KernelMode, FALSE, &timeOut);

        if (status == STATUS_SUCCESS)
        {
            unsigned long * buffer;
            if (NT_SUCCESS(WdfRequestRetrieveOutputBuffer(Request, sizeof(lastPid), &buffer, NULL)))
            {
                *buffer = lastPid;
                bytesReturned = sizeof(lastPid);
            }
        }
        break;
    }
    default:
        break;
    }
    WdfRequestCompleteWithInformation(Request, status, bytesReturned);
    return;
}

and in the driver IO queue setup register this method:

NTSTATUS InterceptorDriverQueueInitialize(_In_ WDFDEVICE Device)
{
    WDFQUEUE queue;
    NTSTATUS status;
    WDF_IO_QUEUE_CONFIG queueConfig;

    PAGED_CODE();

    WDF_IO_QUEUE_CONFIG_INIT_DEFAULT_QUEUE(&queueConfig,WdfIoQueueDispatchParallel);
    queueConfig.EvtIoDeviceControl = InterceptorDriverEvtIoDeviceControl;
    queueConfig.EvtIoStop = InterceptorDriverEvtIoStop;

    status = WdfIoQueueCreate(Device,&queueConfig,WDF_NO_OBJECT_ATTRIBUTES,&queue);
    if( !NT_SUCCESS(status) )
    {
        TraceEvents(TRACE_LEVEL_ERROR, TRACE_QUEUE, "WdfIoQueueCreate failed %!STATUS!", status);
        return status;
    }
    return status;
}

The mechanism we have here is a sort of ‘long-polling’ of the kernel driver: the service sends an IOCTL code to the driver, and the driver pauses the thread on an event which is signaled every time a process is started. Only then does the thread return to usermode, with in its output buffer the ID of the process. To allow for windows service shutdown, it’s advisable to wait for the event with a timeout (and poll again if it returned due to this timeout), otherwise the thread will be stuck in kernel mode until you start one more process – making service shutdown impossible.

2. Remote code injection

We are back in user mode now, and we can run code once a process starts. The next step is to somehow load our JIT hooking code in every new (.NET) process, and make it start executing. There are a couple of ways in which you can do this, and most involve hacks around CreateRemoteThread. This Win32 function allows a process to start a thread in the address space of another process. The challenge is how to get the process to load our hooking code. There are 2 approaches which both require writing into the remote process memory before calling CreateRemoteThread:

  • write the hooking code directly in the remote process, and call CreateRemoteThread with an entry point in this memory
  • compile our hooking code to a dll, and only write the dll name to the remote process memory. Then call CreateRemoteThread with the address of kernel32!LoadLibrary with its argument pointing to the name

As I want to be able to hook the JIT in 32 as well as 64 bit processes, I have to compile 2 versions of the hooking code anyway. For the sake of code modularity and seperation of concerns I opted for the second way, so the simple recipe I took is:

  • A. Write a dll which on load executes the hooking code, and compile it in 2 flavors (32/64 bit).
  • B. In the Windows service, on process start notification, use CreateRemoteThread + LoadLibrary to load the correct flavor of the dll in the target
A. Auto executing library

This is quite easy, but you have to beware the dragons. A dll has a DllMain entry point with signature:

BOOL APIENTRY DllMain( HMODULE hModule, DWORD ul_reason_for_call, LPVOID lpReserved); 

This entry point is called when (specified in ul_reason_for_call) the dll is first loaded or unloaded, or on thread creation/destruction. The thing to beware for is written in the Remarks section: “Access to the entry point is serialized by the system on a process-wide basis. Threads in DllMain hold the loader lock so no additional DLLs can be dynamically loaded or initialized.”. In other words: you can not load a library in code that runs in the context of DllMain.

Why is this a problem for us ? The hooking code has to query the .NET shim library (mscoree.dll) to find out if and which .NET runtimes are loaded in the process. Since there is no a priori way to know for sure the shim library is already loaded when we try to get a module handle, our hooking code may trigger a module load and so a deadlock.

The fix is easy: just start a new thread in the DllMain entrypoint and make that thread query the shim library. This thread will start execution outside the current loader lock.

B. CreateRemoteThread + LoadLibrary

I will skip over most details here as it’s described in much detail in various articles, however there are some things to beware of when cross injecting from the 64 bit service to a 32 bit process. The steps in the ‘regular’ procedure are:

  1. Get access to the process
  2. Write memory with name of hooking Dll
  3. Start remote thread with entrypoint kernel32!LoadLibrary
  4. Free memory

Most of these are straightforward, but there is a problem in cross injecting in step 3, and more specifically in finding the exact address to call.

When injecting in a same bitness architecture, this is easy as we can use a trick: kernel32 is loaded into every process at the same virtual address. This address can change, but only at reboot. Using this trick, we can:

  1. Get the module handle (virtual address) of the kernel32 module in the injecting process – it will be identical in the remote process
  2. Call kernel32!GetProcAddress to find the address of LoadLibrary

When injecting cross bitness, we have 2 problems: the kernel32 loading address is different for 64 and 32 bit, and we can not use kernel32!GetProcAddress on our 64 bit kernel module to find the address in the 32 bit one. To fix this, I replaced the steps above for this scenario by:

  1. Use PSAPI and call EnumProcessModulesEx on the target process, with the explicit LIST_MODULES_32BIT option (there are also 64 modules in a 32 bit process, go figure), get their names (GetModuleBaseName) to find kernel32, and when found get the module address from GetModuleInformation
  2. Use ImageHlp’s MapAndLoad and extract the header information from the PE header of the 32 bit kernel32. Find the export directory and together with the name directory find the RVA of LoadLibrary ourselves (Note: the RVAs in the PE are the in memory RVAs. On disk layout of a PE is different, you can use the section info header to correlate the two). Add this to the number from step 1 to find the VA of kernel32!LoadLibrary

Working setup

A DbgView of the loading and injection in both flavors of .NET processes (32 and 64 bit):

injection_success

 

Note: I strive to put the full code out there eventually. But it may take some time.