NWScript JIT engine: IR instruction raising

August 14th, 2010

Previously, we learned about the high-level structure of the NWScript IR. This time, we’ll take a closer look at the IR raising phase.

The IR raising process is the domain of the NWScript analyzer component. The IR raising sequence has several distinct phases, each of which makes one pass through the logical structure of a script program:

  1. A structural analysis phase, which discovers all subroutines in the program and their parameter and return value sizes (in stack cells). The structural analysis phase also produces control flow objects as branches and branch targets are encountered. Any STORE_STATE resume labels are converted to first-class subroutines during structural analysis as well (with locals inherited from the parent subroutine promoted to parameters). We covered most the structural analysis phase last post for the most part.
  2. A code analysis phase. The code analysis phase performs the bulk of the remaining IR raising work, including creating variable objects for each distinct stack location (at a given control flow point), discovering and propagating the types of variables, discovery and creation of global variables, and actual IR generation.
  3. A post-processing phase. Unlike the other two phases, the post-processing phase operates primarily on the IR and not the NWScript instruction stream. The post-processing phase performs basic optimization on the emitted IR, such as folding of temporary variables generated by NWScript stack operations into their parent variables. Additionally, it generates various hints to JIT backend, such as whether an IR variable is written to exactly once, or whether an IR variable is written to but never read from.

At this point in the project, we executed a divide and conquer strategy; Justin wrote the code analysis and post-processing phases of the NWScript analyzer (and restructured much of what I had written for the structural analysis phase) once we had come up with a design, while I went on to implement the MSIL (.NET) JIT backend that consumed the IR.

I’ll let Justin delve into the full details of the code analysis and post-processing phases and just provide a quick summary here.

Variable stack

The code analysis phase traverses the NWScript instruction stream while maintaing a logical variable stack, that maps NWScript stack locations at the current program counter to IR-level variables. The variable stack facilitates “de-stacking” of NWScript stack variables, plus lifetime management for these variables (once a variable is popped off of the variable stack, the code analyzer knows that its logical lifetime has ended, thus allowing this information to be conveyed in the form of an emitted I_DELETE IR instruction).

Parameter and return value variables can simply be placed on the variable stack (and appropriately flagged) when analysis begins at a particular subroutine; because the structural analysis phase has discovered the number of individual stack cells devoted to parameters and return values for each subroutine, tracking of these constructs within the code analyzer simply boils down to creating untyped variables on the variable stack before NWScript instruction in a particular subroutine is analyzed.

Type-equivalence chains

The code analysis phase discovers variable types by linking variables that can be inferred to have a similar type together into a type equivalence chain. (Type equivalence chain links may be formed by operations such as a subroutine call, or an assignment, where both sides of the operation are known to yield the same type, though that type might not be immediately known.)

Initially, the type equivalence chain may have an indetermine type, but once a typed stack reference is made to any variable in a chain, the types of all associated variables become known. (Only variables that are copied on the NWScript stack but never referenced for a specific operation can remain untyped after code analysis completes when using this system. Since these variables are, by nature, extraneous, leaving them as untyped does not present a problem to program correctness.)

The type-equivalence chain system can be used for strong type discovery because the NWScript stack itself is strongly typed; any time an operation will actually be performed on a variable, its type is specified in the instruction operands embedded in the instruction stream. Thus, once the type of any variable in a type equivalence chain has been uncovered, the types of all other variables are instantly known. (One exception to this is struct/struct comparisons, which are simply a comparison over a range of stack space without a specific type provided. Other references to all variables involved will always have resolved the types involved unless one of the structs was passed as a parameter to the entry point symbol; entry point symbols do not support struct parameters though, precluding that from becoming a problem.)

IR instruction generation

IR instruction generation in the code analysis phase operates by decoding the current NWScript instruction, picking up the actual IR-level variable operands from the variable stack, and creating an IR-level instruction object linking the variables together with the IR instruction descriptor.

Some NWScript instructions may turn into multiple IR instructions (such as a struct/struct NWScript-level OP_EQUAL/OP_NEQUAL comparison instruction, which must produce I_LOGOR/I_LOGAND instructions to combine the result of each successive I_EQUAL/I_NEQUAL comparision in order to form the final result).

Post-processing phase

Similarly, the post-processing phase walks the generated, now functional IR and attempts to fold extraneous temporaries into their parent variables (among other optimization-related tasks). Temporary folding is useful because the raw translation of NWScript instructions to IR-level instructions yields numerous temporaries (reflections of NWScript instructions copying data to the top of the stack for a particular operation); these temporaries can be proven to be removable in many cases. While the JIT backend generally emits code to a system with its own embedded optimizer, eliminating these temporaries reduces the size of the IR (and the output of a non-optimizing JIT backend).

It’s important to note that this overview is a simplification; there are some tricky cases that need special handling, such as merging two control flows (and their associated variables) together.

Once the IR is produced, the JIT backend takes over. The JIT backend is handed a completed IR for the program to JIT; it then assumes responsibility for lowering the NWScript IR into the native form emitted by the JIT backend.

Coming up: Examining the MSIL (.NET) JIT backend in more detail.

NWScript JIT engine: Intermediate Representation (IR) design

August 11th, 2010

Last time, I discussed how parameterized scripts turn out to actually work in NWScript, plus I delved into some details as to how the NWScript analyzer performs its first analysis pass, the structural analysis phase.

Although the NWScript VM need not go further than the structural analysis phase in order to correctly execute parameterized scripts[1], if we were to wish to create a JIT system for NWScript that worked to any degree of efficiency, it’s necessary to perform some more advanced analysis. This additional analysis is needed in order to “flatten” the script into a set of distinct variables that are manipulated by a fundamental instruction set.

The result of the structural analysis phase and the additional analysis phase (which I term the code analysis phase) is the NWScript IR, or intermediate representation. The intermediate representation (IR for short) is a set of data structures that describe the entire functional behavior of a script program in a format that is easy to translate into other forms.

Because the JIT system is designed to be easily extensible to multiple backends, the NWScript analyzer’s analysis phases do not directly emit any machine instructions; it is the job of the JIT backend to translate the resultant IR into a JIT-specific program representation. For the purposes of this post, we’ll focus on the IR itself.

The NWScript IR is designed with several principles in mind:

  • The IR should be designed in such a way that it’s easy to consume from a variety of code generation backends.
  • Each IR instruction should (generally) be independent of any previous instructions. That is, one should not need to track an inordinate amount of state specific to simply processing the IR in order to process each IR instruction. (As you’ll find out, this guideline is partially violated in two occasions for simplicity’s sake, but it generally holds true.) This implies that a non-optimizing backend should generally be able to consume the IR in more or less a single pass, without requiring significant postprocessing work.

With these loose characteristics in mind, let’s look at the basic layout of the IR. The IR consists of a set of in-memory C++ data structures (currently, no serialization format is designed; the backend must be able to directly interface with C++ data structures). These data structures encapsulate the following core concepts:

  • A NWScriptVariable, describing a strongly typed symbolic location in which data can be retained. Unlike the NWScript instruction set’s stack-based system, variables can be referenced freely. Variables can be marked as constant, and can fall under various classifications (locals, globals, parameters, return values, call site (outgoing) parameters, call site (outgoing) return values, etc.) Locals have well-defined lifetimes that are bounded by a series of specific instructions, whereas other variable classes have other, generally obvious lifetime semantics.
  • A NWScriptInstruction, which contains an underlying IR opcode, plus data and variable linkages that form the operands of the instruction. Not all IR instructions are direct analogs of NWScript instructions; some instructions have multiple mappings, or no direct mappings at all.
  • A NWScriptControlFlow. The control flow object represents a contiguous span of IR instructions that are neither the target of a branch, nor the source of a branch. From the IR perspective, a control flow contains a list of IR instructions, a description of how the control flow terminates (i.e. fallthrough, unconditional branch, conditional brach, end of function, merge, etc.), and a list of links to up to two successor control flows. (The successor links point to both forks of a conditional control flow transfer, or one of the links points to the fall through / unconditional endpoint of an unconditional control flow transfer.)
  • A NWScriptSubroutine, representing an IR subroutine. An IR subroutine can represent either a full-fleged NWScript subroutine, or the resume label of a STORE_STATE instruction. In either case, a subroutine groups together a list of control flows, strongly typed parameter variables, strongly typed return types, and several other pieces of information (such as a name, if we could learn the name of a subroutine from a debug symbol file for the script we are analyzing, should we have been able to locate symbols (not required)).

There are several other minor data structures which I will not describe here, as they are not particularly important for understanding the design of the IR. Additionally, various other attributes are associated with these above data items (such as a list of parent links for the control flow object), but these are similarly not required for a basic understanding.

The IR instruction set itself consists of the following operations:

typedef enum _INSTR
{
  I_CREATE,       // Create variable with type (no value)
  I_DELETE,       // Delete variable
  I_ASSIGN,       // Copy (assign) variables
  I_JZ,           // Jump if zero
  I_JNZ,          // Jump if not zero
  I_JMP,          // Jump unconditionally [deprecated]
  I_CALL,         // Call subroutine
  I_RETN,         // Return from subroutine
  I_ACTION,       // Call script action
  I_SAVE_STATE,   // Save state for script situation
  I_LOGAND,       // Logical AND (&&)
  I_LOGOR,        // Logical OR (||)
  I_INCOR,        // Bitwise OR (|)
  I_EXCOR,        // Bitwise XOR (^)
  I_BOOLAND,      // Bitwise AND (&)
  I_EQUAL,        // Compare (==)
  I_NEQUAL,       // Compare (!=)
  I_GEQ,          // Compare (>=)
  I_GT,           // Compare (>)
  I_LT,           // Compare (<)
  I_LEQ,          // Compare (<=)
  I_SHLEFT,       // Shift left (<<)
  I_SHRIGHT,      // Shift signed right (>>, SAR)
  I_USHRIGHT,     // Shift unsigned right (>>)
  I_ADD,          // Add (+), concatenate strings
  I_SUB,          // Subtract (-)
  I_MUL,          // Multiply (*)
  I_DIV,          // Divide (/)
  I_MOD,          // Modulus (%)
  I_NEG,          // Negation (-)
  I_COMP,         // Complement (~)
  I_NOT,          // Logical NOT (!)
  I_INC,          // Increment
  I_DEC,          // Decrement
  I_TEST,      // Set zero/not zero based on variable value
  I_INITIALIZE,   // Set variable to default value

  LASTINSTR
} INSTR, * PINSTR;

typedef const enum _INSTR * PCINSTR;

If you’ve taken a look at Tim Smith’s NWScript instruction set reference, many of the instructions should look familiar to you, in the form of a condensed set of the fundamental NWScript instruction opcodes. However, there are several new instructions, and as I had mentioned previously, you may notice that some NWScript instruction opcodes have no direct equivalents (such as the purely stack manipulation instructions).

Most instructions are relatively straightforward to understand from the comments above, so I will not describe each and every one of them; additionally, for most (data-processing) instructions, a uniform API is provided to retrieve the input/output variable objects affected, making code generation simple.

Several instructions are, however, more structural than data-processing in nature and bear further explanation; these have been highlighted in blue.

The I_CREATE and I_DELETE instructions bound the lifetime of a local variable. When an I_CREATE is issued, a variable is activated for use until an I_DELETE is encountered. These are provided as hints to the backend as to when a variable need remain active. These instructions are loose exceptions to the general design guideline that an instruction should not involve previous state, and can be thought of as optimization hints (versus creating all local variables up front from a code generation perspective).

I_SAVE_STATE represents the operation of the STORE_STATE and STORE_STATEALL NWScript instructions; it supplies a list of local and global variables used by the resume point of a STORE_STATE instruction, as well as a subroutine descriptor describing the resume point’s IR-level instruction code. (The analyzer raises a STORE_STATE resume point into a first class subroutine, with all local variables inherited from the ‘parent’ subroutine having been turned into parameters.) This allows the backend to easily generate code to save the required locals away at runtime for the next operation that consumes a saved state.

I_TEST is used by the conditional jump instructions. As a special exception to the rule that one instruction does not depend on the state of another, the I_TEST instruction must be immediately followed by a conditional jump (I_JZ or I_JNZ) instruction. Its purpose is to take an integer-typed variable operand and make it into a Boolean / logical 1 or zero value for use in the immediately following conditional jump. The purpose of a separate I_TEST instruction is to prevent temporary variables generated by a conditional operation from having to always be bifurcated across both control flow forks in the IR.

I_INITIALIZE supplies a hard initialization sequence point for a variable at which a variable must be assigned its type-specific default value. This IR instruction fulfills two purposes; it allows the backend to avoid emitting non-useful reinitialization code if the backend’s representation of variables are reused in a variable pool of some sort, and it allows the JIT backend to circumvent an issue with the IR raising phase relating to merging of variables across flows in some circumstances. (It is probable that the I_INITIALIZE instruction will go away and turn into simply a flag on a variable object once the merging problem I alluded to is no longer an issue. Justin tells me that he’s planning on writing a post describing the variable merging issue I mentioned, so I’ll let him provide the details there.)

Fear not, there is yet even more NWScript discussion upcoming; stay tuned…

[1] Actually, to be pedantic, there was a rather ugly hack I had inserted in the VM that manages to execute parameterized scripts with wrong parameter counts so long as they do not return a value. In the case of NWN2, this is a particularly important (and the most likely to be hit) case. However, this hack is obsolete with the structural analysis phase providing data on the number of parameters that the entry point symbol takes, so I will not detail the hack here.

NWScript JIT engine: Parameterized script entry point invocation and subroutine structural analysis

August 9th, 2010

Previously, I discussed how subroutine calls and extension (action service handler) calls are made in the NWScript instruction set. This time, we’ll dive in deeper to a feature that NWN2 added to NWScript: Parameterized entry point invocation, or the capability to pass parameters to a script’s main subroutine.

Originally, in the NWN1 days, NWScript did not support providing arguments to a script’s entry point symbol. Instead, when data needed to be made available to a script (say, a script that would be fired when a particular event occured), a special purpose action service handler would be created that would return the parameter to the script program. For example, a script might be called when a player attempts to open a door, and the script might need to deal with both the player object and the door being opened. To handle this case, the designers of NWN1 would create special action service handlers that would return these sorts of event parameters (see an example).

This approach has some advantages and disadvantages. On the upside, it’s very easy to expose new parameter information to scripts called for specific events. On the downside, however, this model is not very intuitive to script authors; the result is that NWN1 (and NWN2 in many cases due to backwards compatibility with NWN1) has a number of very specialized ‘get event information’ action service handlers that can be called meaningfully only at specific points in time. Figuring out which event handler is associated with which dedicated action service handler has been the source of innumerable bugs in NWN scripts.

In NWN2, the game designers added the capability to pass parameters directly to the entry point of a NWScript program, like any other NWScript subroutine, as a mechanism for passing information to script events. However, they had several constraints in doing so:

  • It should be easy to add new parameters without breaking already compiled scripts.
  • Significant changes to the compiler and execution environment should be avoided to the degree possible.

To that end, the NWN2 designers came up with a solution that required no changes to the NWScript compiler toolchain: They kept the existing calling convention for script subroutines and simply used it for the entry point symbol.

While this is an easy solution from the compiler point of view, it creates some unfortunate complications for the execution environment. Specifically, the entry point symbol is not actually the first bit of code executed by a NWScript program. Instead, there are two compiler-emitted stubs that are produced before the entry point:

  1. First, a #loader routine runs. The #loader routine’s purpose is to allocate storage for the entry point symbol’s return value, if the entry point had one, and then call it. (Remember that the caller must allocate storage for the return value of a script subroutine.)
  2. Second, if the script uses global variables, a #globals routine is called by #loader instead of the entry point. The #globals routine establishes a local variable frame containing all global variables in the script, then assigns the auxiliary SP pointer (BP) to its current SP value, before finally invoking the real entry point symbol.

If you’re paying attention, you may begin to see the problem that will arise here. Because the entry point symbol isn’t the first symbol called when the script is invoked, the execution environment can’t simply put the parameters on the stack before calling into the script the first time.

Why is this the case, you might ask? There are several issues here:

First, if the script returned a value, the return value will be pushed by #loader before the user’s entry point is called. This means that the return value of the entry point would appear to be a parameter to the entry point (recall that the return value should is the last item on the stack when calling a subroutine, and it is allocated by the caller of a subroutine).

Secondly, if the script program used global variables, were the execution environment to push parameters before first entering the script, the parameters would take the place of several global variables, and some of the global variables would turn into parameters to the script entry point.

In either case, we clearly have what might technically be described as a train wreck; the script program would behave wildly incorrectly if #loader and #globals were not modified to be aware of parameters passed to the entry point symbol.

While this could have been done, the designers of NWN2 elected not to do so. (Even if these two compiler-generated stubs were modified, the calling convention for a particular (event handler) script’s entry point would become fixed for all time — it would not be possible to add another parameter to the entry point symbol without breaking already compiled scripts.)

Instead, a series of modifications were made to the execution environment itself to cope with the fact that the compiler toolchain wasn’t updated, and the fact that the script might be designed to be called with a different count of parameters.

Specifically, when a parameterized script is going to be invoked, the execution environment must perform the following steps:

  1. The actual number of parameters that the entry point symbol is determined, and the number and types of parameters passed to the script are adjusted accordingly. In the case of the stock NWN2 implementation, this is accomplished by requiring the source text for a parameterized script to be available; the source text compiler is then used to discover the parameter counts and types to the entry point symbol.
  2. The execution environment arranges for parameters to be pushed after the entry point symbol is first branched to by watching for either #loader or #globals (which have several uniquely recognizable characteristics) to complete.

The first requirement, in particular, is a highly undesirable and onerous one (a compiled instruction set language should not require the execution environment to possess the source text).

For my implementation of a NWScript VM (and later, the JIT engine), I solved the first problem with a different approach (the second problem continues to be addressed by watching for #loader or #globals to finish execution within my VM):

Instead of relying on embedding a source compiler in my VM, I tagged the parameters to the entry point symbol as having an indetermine type on the VM stack. NWN2 defines conversion rules from strings to script parameters, as in some cases parameterized scripts are invoked from a list of strings. Finally, once an indetermine-typed parameter is referenced on the stack with, it is simply changed into the type that it was referenced as (since all stack references must be strongly typed, there is no ambiguity other than for a bulk copy operation, which doesn’t require a specific known type).

This doesn’t solve the problem of determining the count of parameters to the script, though. For that, my implementation performs a basic structural analysis phase of the script program instruction set. (The structural analysis phase — a subset of the NWScript Analyzer component — will also become critical to the JIT engine itself, as we’ll see later.)

The structural analysis operates as a static analysis step, wherein the flow of each call instruction in a script program is followed. This results in the discovery of several useful pieces of information:

  1. All subroutines in the compiled script program are discovered by walking through the instruction stream.
  2. The count of parameters and return values to each subroutine in the script program are discovered (i.e. the stack displacement of each subroutine across a call site).

The core of the structural analysis phase relies on the following properties of the NWScript instruction set:

  1. All references to the stack in the NWScript instruction set must specify a displacement that is a compile-time constant. In particular, this means that constructs such as the creation of a variable amount of locals within a subroutine cannot be directly encoded into the instruction set[1].
  2. All subroutines within the NWScript program follow the standard calling convention (except for #loader and #globals, which can be identified by virtue of always being the first subroutine invoked, or being the only subroutine to establish a global variables frame with the special purpose SAVEBP instruction). Notably, this means that if a subroutine is called with P parameter cells on the stack and R return cells on the stack, that on return from the subroutine, there will have been a write to a stack location R cells down from the top of the stack, and P cells will have been removed from the top of stack relative to subroutine entry.
  3. No subroutine in the script program is unconditionally recursive without any other exit path (such a subroutine would make little sense in practice, so this is seen as a reasonable limitation). This is important, because analyzing the structural (stack displacment) properties of a subroutine requires tracing through to the end of the subroutine in at least one flow path successfully.
  4. The structural analyzer must be provided with data on the stack displacement properties of all action service handlers. This is considered an acceptable limitation as the VM and the execution environment must be used in conjuction with action service handler implementations anyway. In this particular case, a program to designed to perform a very basic parse of the prototypes of the action handlers exposed by NWN and NWN2 was used to generate the data tables for those games.

Given these properties, it becomes possible to construct an algorithm to discover the structural properties (that is, the location, parameter count, and return count) of each subroutine in the script program. The algorithm can be described roughly as follows (excluding #loader and #globals which can be special cased safely):

  1. At the start, set virtual SP=0.
  2. Walk one instruction at a time forward via disassembly.
  3. If the instruction references the stack…
    • If the current instruction references the stack at a greater displacement than the previous greatest displacement within this subroutine, record the new displacement.
    • If the current instruction pointer modifies the stack, update the current virtual SP appropriately.
    • If the current instruction is a branch instruction, and we have not yet visited that branch, mark the branch as visited and queue a traversal of the branch (start at step #2 with the virtual SP at the time when the branch instruction was encountered). If the branch was unconditional and we had already visited the branch, switch to the next analysis queue entry.
  4. If the current instruction is an action service handler call (ACTION), update the virtual SP based on the action service handler data table.
  5. If the current instruction is a subroutine call…
    • If the subroutine has not been analyzed yet, then queue a new analysis run (at step #1) for the new subroutine, queue a new analysis run for the current call site address and mark it as blocking on the new subroutine (taking the current virtual SP), and then switch to processing the next unanalyzed branch that is not blocking on a subroutine. If there are no unanalyzed branches, we have detected a case of unconditional recursion and analysis fails.
    • If the subroutine has been analyzed already, then adjust the virtual SP accordingly.
  6. When a RET instruction is encountered, record the largest SP offset encountered and record the displacement of the virtual SP at the return point from 0. If virtual SP is 0, then there were no parameters. If virtual SP is -1 cell, then there was 1 cell of parameters, etc. Additionally, if the largest SP offset was greater than the displacement of the virtual SP from 0, then the remainder is the count of return cells for the subroutine. Finally, once all the above tasks are completed when the RET instruction is observed, switch to the next analysis queue entry, or if we have none left, analysis is complete.

This algorithm, repeated iteratively, discovers the structure of the script program. With the structure of the program discovered, the VM execution environment can fulfill the task of invoking parameterized scripts without requiring a source text compiler on-board with it (a notable improvement over the game’s stock execution environment).

Next time, we’ll look more at the overall design of the NWScript analyzer and the IR that it produces. (If you are wanting to sneak an early look, Justin has a post up describing some more details about the analyzer.)

[1] It is possible to construct a program that accesses SP from a variable offset through creative use of the SAVEBP/RESTOREBP instructions in a hand-assembled script due to a quirk of the stock game VM, which saves the previous BP on the stack as an integer-type value. No such specimens are known to exist in the wild, and none of the known NWScript compiler implementations emit code to do so.

NWScript JIT engine: Subroutines, extension calls, and saved state

August 5th, 2010

Previously, I discussed the fundamentals of how the NWScript instruction set operates, particularly with respect to the stack, the primary data store of the NWScript instruction set.

I left several topics out though, including subroutine calls, extension calls to APIs exposed by the script host (action service handler invocations, in NWScript parlance), and saved state (pseudo-coroutines). It’s worth discussing these before continuing further as the nature of how these constructs are implemented shapes the implementation of the JIT environment, in turn.

Subroutine calls

As with most programming languages, NWScript allows the programmer to define subroutines. NWScript’s support for subroutines is most akin to C-style functions (there are no classes in NWScript and thus no concept of member functions).

Unlike C, though, NWScript does not allow the control transfer address in the subroutine call instruction (JSR) to come from a non-constant source. Instead, the subroutine target must be hardcoded into the instruction operand (the instruction stream data space is not accessible to the instruction set itself).

One upshot of this is that, despite the fact that there is no structural listing of all subroutines in a compiled NWScript program, one can still determinstically identify such a listing by walking the instruction stream opcode-by-opcode.

Subroutines in NWScript are all invoked with a uniform calling convention, described as follows:

  1. The caller first allocates stack space for the return values (if any) of the callee. (NWScript subroutines have a single return value from a source level perspective, but a subroutine that returns an aggregate type is implemented by simply returning multiple values.) Return value storage is created on the stack from right to left, and is created typed based on the return types of the subroutine. (Canonically, every stack cell is strongly typed in the NWScript instruction set.)
  2. The caller then pushes the arguments (if any) to the callee in right-to-left order.
  3. The caller executes the subroutine call via a JSR (jump subroutine) instruction. The return program counter is stored in a separate, dedicated stack that is invisible to the NWScript program (the stack pointer is not modified).
  4. The callee completes its processing, and then copies any return values, if any existed, down the stack (to where the caller allocated return value space).
  5. The callee cleans the parameters off the stack, and then executes a RET instruction to return to the caller. If there was a return program counter on the invisible return stack, control is transferred to it, otherwise the script program ends (i.e. return from main).

The main points to keep in mind are that the callee allocates the return value and parameter slot space, and the return values are left on the stack after the call.

The user’s entry point symbol (e.g. main) acts the same as any other subroutine from a calling convention perspective. However, there are some additional considerations when it comes to passing parameters to the entry point symbol, which we’ll discuss later.

Action service handler invocations (extension points)

In order for a script program to be particularly useful, it must have a means to interface with the outside world. This is accomplished via a set of action service handlers that a script host can expose to script programs.

Action service handlers can be logically thought of as an array of function pointers with a well-defined, uniform calling convention that can be invoked by the script VM on behalf of NWScript. As with subroutine calls, parameters and return values of action service handler calls are passed on the VM stack. There is a difference in the VM stack calling convention versus standard subroutine calls, however, in that the action servicer handler is responsible for pushing the return values (if any) on to the VM stack after it cleans the parameters. (Recall that for a standard subroutine call, the callee allocates space for the return value.)

In this respect, one can think of an action service handler call as simply an intrinsic, extended instruction that may be invoked.

To call an action service handler, a NWScript program uses the ACTION instruction, which includes several pieces of information:

  • An action ordinal, indicating which action service handler is to be invoked by the VM.
  • A parameter count, indicating how many (source-level) arguments have been passed to the action service routine. This is the count of arguments that the compiler recognized for the action handler at compile time and may not necessarily reflect the actual count of stack cells (for example, a ‘vector’ argument consumes three stack cells but increments the action parameter count by one).

Because there is no mechanism to perform structured linking of external procedure calls, the action service ordinal embedded into the instruction opcode indicates the extension function to invoke in the script host (similar to how a BIOS might expose functionality to real mode code via a specific interrupt number).

This mechanism provides two mechanisms to allow the set of actions defined by a script host to be expanded without breaking already compiled script programs:

  1. Because actions are identified by ordinal, a script host can define a new action service hander and place it at the end of the action service handler array. Previously compiled script programs will still function, because prior action ordinals remain unchanged.
  2. The ‘parameter count’ operand embedded in the ACTION instruction allows new parameters to be added (defaulted) to the end of the parameter list for a pre-existing action. In this case, by inspecting the parameter count, an action service handler can detect which parameters should be set to a defaulted value (if an old compiled script is being executed), or which are actually on the VM stack (if a new compiled script is being executed).

Additionally, in this design, the script source text compiler and the script VM themselves do not need to have any specific, hardcoded knowledge of the action service handlers. (The source text compiler looks for a list of prototypes in a well-defined header file, and assumes that these prototypes are actual service handler definitions, with the first prototype being action ordinal zero, and so forth.) This means that the script system itself can be structured to be logically isolated from the actual extension API exposed by a script host, making the core script compilation and runtime environment easily portable to new API sets (i.e. new games).

A typical implementation of the script environment would expose an API for action service handlers to push and pop data values from the VM stack, leaving the actual parameter and return value manipulation up to the implementation of a particular action service handler.

For example, an action service handler named ‘AngleToVector’ that might be prototyped like so in NWScript source text:

// Convert fAngle to a vector
vector AngleToVector(float fAngle);

… might typically be implemented like so by a script host:

#define SCRIPT_ACTION( Name )                                  \
	void                                                       \
	NWScriptHost::OnAction_##Name(                             \
	    __in NWScriptVM & ScriptVM,                            \
	    __in NWScriptStack & VMStack,                          \
	    __in NWSCRIPT_ACTION ActionId,                         \
	    __in size_t NumArguments                               \
	    )                                                      
// ...
SCRIPT_ACTION( AngleToVector )
/*++

Routine Description:

	This script action converts an angle to a vector.

Arguments:

	fAngle - Supplies the angle.

Return Value:

	The routine returns the converted vector.

Environment:

	User mode.

--*/
{
	float        fAngle = VMStack.StackPopFloat( ) * (PI / 180.0f);
	NWN::Vector3 Heading;

	Heading.x = cos( fAngle );
	Heading.y = sin( fAngle );
	Heading.z = 0.0f;

	VMStack.StackPushVector( Heading );
}

In a properly structured script execution environment, all of the application-specific extension points, such as the ‘AngleToVector’ action service handler described above, can thus be cleanly separated out from the core execution environment.

Saved states

As I mentioned previously, NWScript is designed to implement an easy-to-program concept for deferred execution. From a source-level, this is implemented by the concept of an ‘action’ argument type that may be passed to an action service handler (this type cannot be used for any purpose except in an action service handler prototype, as far as the source text compiler is concerned).

When the source text compiler encounters an ‘action’-typed value being referenced, it does not generate a typical stack data store instruction. Instead, it emits a special instruction, STORE_STATE, that directs the execution environment to save the current state of the script environment. For a typical VM-based implementation, this can very easily be accomplished by making a copy of the VM stack (which implicitly duplicates the entire data state of the script program).

The STORE_STATE instruction includes a (constant) operand that specifies a program counter to resume execution at. Using a saved VM execution state, the execution of the currently running script can be resumed at the STORE_STATE resume label. Because almost all of the state associated with the logical script invocation is present on the VM stack, resuming execution at a STORE_STATE resume label is very simple to implement for the VM-based execution environment.

From a NWScript source-text compiler perspective, any time that an ‘action’ type is passed to an action service handler, the source compiler allows an arbitrary statement that returns void to be substituted. (Typically, this might be a call to a subroutine that passes local or global variables on the argument list.)

When such a saved state construct is resumed, this arbitrary statement represents the STORE_STATE resume label, allowing a section of the script to be forked off for later execution (in a way that is conceptually similar in many respects to the standard POSIX ‘fork’ system call). The resume label has access to the global and local variables of the script program at the time when the script program’s state was saved, making maintaining state across a deferred execution context easy for the programmer.

From a script execution environment perspective, the execution environment would typically provide a special-purpose API to retrieve a snapshot of the most recently saved state (as created by a STORE_STATE instruction). This API would typically be used by action service handlers that accept an ‘action’-typed arguments.

To see how this paradigm can be used by a script host to provide a very easy to use way to delay execution, consider the following action service handler in NWN1/NWN2:

// Delay aActionToDelay by fSeconds.
void DelayCommand(float fSeconds, action aActionToDelay);

From a source level, it might be common to utilize this action service handler as follows:

DelayCommand( 1.0, DelayedFunctionCall( params ) );

The effect of such a source-level statement is that the ‘DelayedFunctionCall( params )’ section of logic is executed once the DelayCommand time expires (if we are speaking of the NWN1/NWN2 extension API set). Note that the script author is now freed of complex logic to package state up a timer callback or similar construct as other languages might often require; the deferred execution “just works” in a very intuitive manner.

NWScript JIT engine: A (more detailed) overview of NWScript

August 3rd, 2010

Last time, I outlined the basic scope of one of my recent side projects, a JIT engine for NWScript. Although I provided a brief, very high-level overview of the NWScript environment, it’s necessary to understand more details about the inner workings of the instruction set in order for various design decisions in the JIT project to make sense.

Based on the way the NWScript instruction set functions, we can derive the following probable design goals for the instruction set:

  • Extensibility. It should be easy to expose new APIs to the script environment without modifying the compiler or the core script VM logic itself. Furthermore, it should be possible to add new script APIs without breaking previously compiled scripts. Although BioWare’s game engines have evolved significantly since NWScript made its public debut in Neverwinter Nights 1 (“Aurora”), the script instruction set has seen comparatively minor revisions despite a wide range of hosting environments.
  • Portability. The compiled NWScript instruction set should be easily executable on any particular architecture (i.e. there should be no need to write a compiler for each platform the game is ported to, nor should one need to compile again for each platform supported). Neverwinter Nights 1 supported Win32 and Linux, and Dragon Age supports Win32 and Xbox360.
  • Safety (containment). End-users should be free to download custom content with custom script code without fear of the script code taking control over their machine.
  • Suspend / resume (coroutine-like) support. NWScript is designed to easily support timed deferral of various operations at well-defined sequence points in a way that is very natural to program, which is implemented by a coroutine-like saved state mechanism that should be simple to support in the execution environment.

With these design principles in mind, let’s dive in deeper as to how the script instruction set operates. The first major concept that one needs to understand about NWScript is that of the VM stack (or simply stack for short).

Note: For the curious, I have mirrored a copy of Tim Smith’s original NWScript instruction set reference here. There are several errata in Tim’s documentation that have since been identified, but it remains the definitive source on the subject. This posting series generally will not discuss or mention every single instruction in great detail, so the instruction set reference itself is optional reading.

NWScript VM stack

Due to its design nature as an intended interpretive-only scripting environmen, the NWScript instruction set is purely stack-based. That is to say that the primary (and only) data store of a program authored in NWScript is the VM stack, which is an abstract data store that the execution environment provides. This is a common paradigm among other compiled, interpretive languages, designed for ease of use by the interpreter. Remember that NWScript was not originally designed to be executed natively, so any registers that the instruction language provided would most likely be handled by memory indirections anyway.

(To be pedantic, there are three pseudo-registers that a NWScript VM must keep track of — the current program counter (PC), the current top of stack pointer (SP), and the auxiliary stack pointer (BP). These pseudo-register values are generally only implicitly referenced by NWScript instructions in the form of relative address operands.)

The stack is laid out as an expand-up array of uniformly-sized stack cells. Each stack cell holds a single data element, and, as mentioned, all stack cells consume the same amount of space on the logical VM stack (4 addresses per stack cell). In practice, a program is ill-formed if it acesses the stack at a misaligned stack cell offset, so for all intents and purposes the stack cell can be considered the atomic data unit size on the VM stack.

(Why 4 addresses (‘bytes’) per stack cell, you might ask? This is actually an area where NWScript presents a somewhat leaky abstraction; the BioWare NWScript VM internally treats the VM stack as essentially a byte array within which either a raw data element or a pointer to that data element may be stored (the BioWare VM only supports 32-bit targets, making for 4 addresses consumed per stack cell). Nonetheless, all stack references in a program must be aligned to multiples of a size of a stack cell for the program to be considered well-formed.)

Nearly every instruction in the NWScript instruction set that takes a non-constant argument receives that argument from the VM stack (typically the top of the stack). Furthermore, because most instructions only act on the top of the stack, and there is no secondary data store for local variables (as in, say, MSIL), a NWScript program will often spend a non-trivial amount of time moving data around the VM stack.

For example, consider a NWScript subroutine that has two integer local variables, i and j. If this subroutine contains code structured along the lines of i = i + j;, one could expect to see the following general sequence of operations:

  1. Create a copy of j on the top of the stack (via CPTOPSP).
  2. Create a copy of i on the top of the stack (via CPTOPSP).
  3. Add the two variables together (via ADDII). This results in the two integer values at the top of the stack being popped off, and their sum being pushed on to the top of the stack.
  4. Copy the result back down to the permanent copy of i (via CPDOWNSP).
  5. Delete the temporary result of the addition from the top of the stack (via MOVSP).

While this logic is perfectly functional, it leaves plenty of room for improvement by a JIT system (four out of five instructions generated are merely artifacts of the stack-based instruction set versus performing the actual work of the operation).

Extraneous temporary copies could be removed (particularly if the language that the JIT backend translates to supports any sort of optimization of its own), and a number of type and validity checks on each stack reference that the interpretive VM must perform on most instructions could be dispensed with (having been enforced at JIT compilation time).

Data types

There are several fundamental data types recognized by the script instruction set. Each data type is logically stored by value on the stack, and each of these data types consumes a single stack cell:

  • int, the 32-bit signed integer.
  • float, the 32-bit (single precision) IEEE float.
  • object, a 32-bit unsigned quantity which cannot be directly modified by the core instruction set (except the creation of a constant value of type object with two well-known constant values). There are no built-in instructions that can operate on a value of type object (aside from constant creation and raw equality testing on the 32-bit unsigned bit pattern), but an extension API exposed to the script program by the script host may make use of object typed values. Generally, an object-typed value is used as a handle to reference a game entity (such as a door or a creature), but the script instruction set imposes no particular meaning on this type.
  • string, a counted 8-bit character string. Strings are logically stored by value on the stack (and as such, they are essentially immutable, in that any operation that ‘modifies’ a string simply creates a new string with the modified value).
  • vector, a set of three loating point (float) values. A vector is not actually a discrete type, but a rather triplet of three float values allocated adjacent on the stack. Several instructions have forms optimized to operate on this data construct, but a vector is simply otherwise created and manipulated as three float stack cells on the stack. (There is no hard binding of the three floats, rather they are simply three float values created adjacent to one another.)

Additionally, the instruction set allows for a customizable set of up to 10 engine structure types. Engine structure types are opaque to the instruction set itself (and thus, by extension, the source text compiler). They are used to safely package up a data structure maintained by the script host for transmission across a script program authored in NWScript. An engine structure type allows the creation of an empty (or null) value, and supports equality testing by means of a well-defined callout to the script host to compare the equality of two engine structures of the same type.

Additionally, there is an internal data type that may be stored on the stack: stack pointer, which represents a saved value of the current stack pointer. This data type is generally not directly referenced by script code, and is used only during the creation of global variables. (My script VM, NWScriptVM, treats the stack pointer type as a unique type, whereas the BioWare VM technically treats it as an int. A well-formed script program should not read or write a value of this type directly, and it is used only as a side effect by certain instructions.)

Conspicuously missing from the list of data types available for storage is that of an instruction pointer. While NWScript supports the concept of subroutines and subroutine calls, the return instruction pointer is saved in a separate region that is inaccessible to the script program itself. Furthermore, all control transfers must come to either specific extension points, or hardwired subroutine addresses specified as operands to instructions (i.e. there is no concept of a virtual method call or a function pointer in NWScript).

Global variables

If you’ve been following along, you may have noticed a seemingly contradictory set of points. Namely, that I have made reference to the fact that a script program can support global variables, despite that a script is essentially a completely unstructured instruction stream.

In the NWScript instruction set, global variables are referenced relative to an auxiliary stack pointer value (internally known as BP, or base pointer). This auxiliary stack pointer value generally points to the top of a call frame set up by a special compiler-emitted subroutine (internally called #globals) that runs before the main program entry point. The #globals subroutine creates a data elements on its stack frame, and then uses a special-purpose pair of instructions (SAVEBP and RESTOREBP) in order to establish or de-establish a global variable frame, by assigning the current top of stack address to the active base pointer value.

Once #globals has finished preparing a script program’s global variables, it transfers control to the user’s entry point subroutine (i.e. main()).

Aggregates

Although the NWScript source text compiler supports the notion of aggregate types, the instruction set only has a very minimal awareness of aggregates at the instruction level.

All fields of an aggregate type are allocated adjacent on the stack. Operand forms are provided for several instructions to allow them to be repeated over a range of stack locations (to compare structures, for example), and several instructions have special support for performing built-in operations on a the built-in vector structure (a tuple of three floats); for example, there is a derivative of the MUL instruction to support multiplying one float and a series of three other floats in a fashion consistent with vector/scalar multiplication.

Next time, we’ll take a closer look at how subroutine calls, script host extension point (action service handler) calls, and saved states operate.

Erratum for Tim Smith’s NWScript documentation

The following are known errata with respect to Tim Smith’s documentation on the NWScript instruction set:

  • SAVEBP and RESTOREBP are incorrectly documented as not modifying the current value of SP. The current BP value (on entry to SAVEBP) is pushed onto the stack, and similarly restored from the current top of stack by SAVEBP.
  • Several legal opcode + type opcode combinations are omitted from the documentation.

Side projects: NWScript JIT engine (introduction)

August 2nd, 2010

As an interesting experiment in program analysis, one of the side projects that I’ve been working on in my spare time is an execution environment for NWScript, the scripting language for Neverwinter Nights / Neverwinter Nights 2 and the Dragon Age series of games. (I happen to have spent some time analyzing NWN/NWN2 recently, hence this project.)

NWScript 101

From a source-level perspective, NWScript provides a C-like programming environment (there are no native arrays or pointers in the NWN/NWN2 iterations of NWScript). Like C, NWScript source text must be compiled to an object-code form in order for it to be consumed by the execution environment.

As a result, only the source text compiler for NWScript must deal with the program source code directly; the actual execution environment only deals with the NWScript object code. I’ll go into more details as to how the object code is structured in a future posting, but here’s a brief overview necessary to understand the scope of the project:

The NWScript object code format is essentially simply a raw instruction stream (think an old-style .com file), consisting of a series of NWScript instructions. The instruction set (in NWN/NWN2) is a stack-based, strongly-typed, small-complexity set of 45 distinct, variable-length instructions. (Some instructions perform several overloaded functions based on their operand types.) All control transfers in NWScript must be performed to fixed locations hardwired at compile time.

These properties combine to make the NWScript object code format readily analyzable (given sufficient effort), despite the fact that it is essentially provided in a completely unstructured, raw instruction-stream based form.

Project scope

The execution environment that I set out on creating involves several different components:

  • A NWScript VM, capable of interpretive execution of the NWScript object code format. (The BioWare games in question also use an interpretive VM style of execution environment.) While potentially slow(er), the NWScript VM supports even atypical, hand-built NWScript programs (as opposed to those emitted by a compiler).
  • A NWScript analyzer, capable of inspecting a NWScript program in object code form and performing several levels of analysis, up to producing a high-level, full-featured IR (or intermediate representation) describing the NWScript program’s behavior. The IR raising phase implemented by the NWScript analyzer is agnostic to any particular following backend phase, and it provides an opportunity to optimize the generated IR while it is in a backend-neutral form.
  • A NWScript JIT backend, capable of translating the high-level NWScript IR into machine-level instructions for native execution. The JIT backend is intended to be essentially a simple translator to a JIT environment that consumes NWScript IR (generated by the analysis phase), and ultimately emits native code. For this project, the first NWScript JIT backend that I have created emits MSIL that can be compiled to native code by the .NET JIT environment.

Out of scope for this particular posting series is the concept of a NWScript host, which is a program that invokes the NWScript execution environment in order to run a script. The NWScript host provides a series of extension points (action service handlers) that may be called by the script program in order to interface with the outside world. An example of a NWScript host might be the NWN1/NWN2 game servers.

An advanced NWScript host might utilize both the NWScript VM and the NWScript JIT system provided by this project (for example, scripts could be interpretively executed by the script VM until background JIT compilation finishes, after which subsequent execution could fall over to the JIT’d version of the script program).

Next time, I’ll go into some more gory details on the nature of the NWScript object code.

Credits

Tim Smith (Torlack), who now works at BioWare, documented the NWScript instruction set several years ago, and Justin Olbrantz contributed much of the code to the IR-raising portion of this project.

Handy debugger tricks: Setting osloader options on a per-boot basis

July 30th, 2010

Sometimes, you’ll find yourself wishing that you could edit the boot options for a particular Windows instance just for a single boot (perhaps to enable debugging with non-default parameters, above and beyond what the F8 menus allows).

While the F8 boot menu has a lot of options, it’s sometimes the case that you need greater flexibility than what it provides (say, to try and enable USB2 debugging on the fly for a single boot — perhaps you need to use the debugger to rescue a system that won’t boot after a change you’ve made, for instance).

It turns out that there’s actually a (perhaps little-known) way to do this starting with Windows Vista and later; at boot time (whenever you could enter the F8 menu), you can strike F10 and find yourself at a prompt that allows you to directly edit the osloader options for the current boot. (Remember that the settings you enter here aren’t persisted across reboots.)

From here, can enable debugging or change any other osloader setting which you could permanently configure through bcdedit (but only for this boot). This capability is especially helpful if you need to debug setup with non-standard debugger connection setting, which otherwise presents a painful problem.

Remember that osloader options are the old-style options that you used to set in boot.ini (the debugger documentation and MSDN outline these). Don’t use the new-style bcdedit names, as those are only recognized by bcdedit; internally, the options passed on the osloader command line continue to be their old forms (i.e. /DEBUG /DEBUGPORT=1394 /CHANNEL=0).

Debugger tricks: Find all probable CONTEXT records in a crash dump

March 30th, 2009

If you’ve debugged crash dumps for awhile, then you’ve probably ran into a situation where the initial dump context provided by the debugger corresponds to a secondary exception that happened while processing an initial exception that’s likely closer to the original underlying problem in the issue you’re investigating.

This can be annoying, as the “.ecxr” command will point you at the site of the secondary failure exception, and not the original exception context itself. However, in most cases the original, primary exception context is still there on the stack; one just needs to know how to find it.

There are a couple of ways to go about this:

  • For hardware generated exceptions (such as access violations), one can look for ntdll!KiUserExceptionDispatcher on the stack, which takes a PCONTEXT and an PEXCEPTION_RECORD as arguments.
  • For software-generated exceptions (such as C++ exceptions), things get a bit dicier. One can look for ntdll!RtlDispatchException being called on the stack, and from there, grab the PCONTEXT parameter.

This can be a bit tedious if stack unwinding fails, or you’re dealing with one of those dumps where exceptions on multiple threads at the same time, typically due to crash dump writing going haywire (I’m looking at you, Outlook…). It would be nice if the debugger could automate this process a little bit.

Fortunately, it’s actually not hard to do this with a little bit of a brute-force approach. Specifically, just a plain old “dumb” memory scan for something common to most all CONTEXT records. It’s not exactly a finesse approach, but it’s usually a lot faster than manually groveling through the stack, especially if multiple threads or multiple nested exceptions are involved. While there may be false-positives, it’s usually immediately obvious as to what makes sense to be involved with a live exception or not. Sometimes, however, quick-and-dirty brute force type solutions really end up doing the trick, though.

In order to find CONTEXT records based on a memory search, though, we need some common data points that are typically the same for all CONTEXT structures, and, preferably, contiguous (for ease of use with the “s” command, the debugger’s memory search support). Fortunately, it turns out that this exists in the form of the segment registers of a CONTEXT structure:

0:000> dt ntdll!_CONTEXT
+0x000 ContextFlags : Uint4B
[…]

+0x08c SegGs : Uint4B
+0x090 SegFs : Uint4B
+0x094 SegEs : Uint4B
+0x098 SegDs : Uint4B

[…]

Now, it turns out that for all threads in a given process will almost always have the same segment selector values, excluding exotic and highly out of the ordinary cases like VDM processes. (The same goes for the segment selector values on x64 as well.) Four non-zero 32-bit values (actually, 16-bit values with zero padding to 32-bits) are enough to be able to reasonably pull a search off without being buried in false positives. Here’s how to do it with the infamous WinDbg debugger script (also applicable to other DbgEng-enabled programs, such as kd):

.foreach ( CxrPtr { s -[w1]d 0 l?ffffffff @gs @fs @es @ds } ) { .cxr CxrPtr – 8c }

This is a bit of a long-winded command, so let’s break it down into the individual components. First, we have a “.foreach” construct, which according to the debugger documentation, follows this convention:

.foreach [Options] ( Variable { InCommands } ) { OutCommands }

The .foreach command (actually one of the more versitle debugger-scripting commands, once one gets used to using it) basically takes a series of input strings generated by an input command (InCommands) and invokes an command to process that output (OutCommands), with the results of the input command being subsituted in as a macro specified by the Variable argument. It’s ugly and operates based on text parsing (and there’s support for skipping every X inputs, among other things; see the debugger documentation), but it gets the job done.

The next part of this operation is the s command, which instructs the debugger to search for a pattern in-memory in the target. The arguments supplied here instruct the debugger to search only writable memory (w), output only the address of each match (1), scan for DWORD (4-byte) sized quanitites (d) in the lower 4GB of the address space (0 l?ffffffff); in this case, we’re assuming that the target is a 32-bit process (which might be hosted on Wow64, hence 4GB instead of 3GB used). The remainder of the command specifies the search pattern to look for; the segment register values of the current thread. The “s” command sports a plethora of other options (with a rather unwieldy and convoluted syntax, unfortunately); the debugger documentation runs through the gamut of the other capabilities.

The final part of this command string is the output command, which simply directs the debugger to set the current context to the input command output replacement macro’s value at an offset of 0x8c. (If one recalls, 0x8c is the offset from the start of a struct _CONTEXT to the SegGs member, which is the first value we search for; as a result, the addresses returned by the “s” command will be the address of the SegGs member.) Remember that we restricted the output of the “s” command to just being the address itself, which lets us easily pass that on to a different command (which might give one the idea that the “s” and “.foreach” commands were made to work together).

Putting the command string together, it directs the debugger to search for a sequence of four 32-bit values (the gs, fs, es, and ds segment selector values for the current thread) in contiguous memory, and display the containing CONTEXT structure for each match.

You may find some other CONTEXT records aside from exception-related ones while executing this comamnd (in particular, initial thread contexts are common), but the ones related to a fault are usually pretty obvious and self-evident. Of course, this method isn’t foolproof, but it lets the debugger do some of the hard work for you (which beats manually groveling in corrupted stacks across multiple threads just to pull out some CONTEXT records).

Naturally, there are a number of other uses for both the “.foreach” and “s” commands; don’t be afraid to experiment with them. There are other helpers for automating certain tasks (!for_each_frame, !for_each_local, !for_each_module, !for_each_process, and !for_each_thread, to name a few) too, aside from the general “.foreach“. The debugger scripting support might not be the prettiest to look at, but it can be quite handy at speeding up common, repetitive tasks.

One parting tip with “.foreach” (well, actually two parting tips): The variable replacement macro only works if you separate it from other symbols with a space. This can be a problem in some cases (where you need to perform some arithmetic on the resulting expanded macro in particular, such as subtracting 0x8c in this case), however, as the spaces remain when the macro symbol is expanded. Some commands, such as “dt“, don’t follow the standard expression parsing rules (much to my eternal irritation), and choke if they’ve given arguments with spaces.

All is not lost for these commands, however; one way to work around this issue is to store the macro replacement into a pseudo-register (say, “r @$t0 = ReplacementVariableMacro – 0x8c“) and use that pseudo-register in the actual output command, as you can issue multiple, semi-colon delimited commands in the output commands section.

Examining kernel stacks on Vista/Srv08 using kdbgctrl -td, even when you haven’t booted /DEBUG

March 13th, 2009

Starting with Vista/Srv08, local kernel debugging support has been locked down to require that the system was booted /DEBUG before-hand.

For me, this has been the source of great annoyance, as if something strange happens to a Vista/Srv08 box that requires peering into kernel mode (usually to get a stack trace), then one tends to hit a brick wall if the box wasn’t booted with /DEBUG. (The only supported option usually available would be manually bugcheck the box and hope that the crash dump, if the system was configured to write one, has useful data. This is, of course, not a particularly great option; especially as in the default configuration for dumps on Vista/Srv08, it’s likely that user mode memory won’t be captured.)

However, it turns out that there’s limited support in the operating system to capture some kernel mode data for debugging purposes, even if you haven’t booted /DEBUG. Specifically, kdbgctrl.exe (a tool shipping with the WinDbg distribution) supports the notion of capturing a “kernel triage dump”, which basically takes a miniature snapshot of a given process and its active user mode threads. To use this feature, you need to have SeDebugPrivilege (i.e. you need to be an administrative-equivalently-privileged user).

Unlike conventional local KD support, triage dump writing doesn’t give unrestricted access to kernel mode memory on a live system. Instead, it instructs the kernel to create a small dump file that contains a limited, pre-set amount of information (mostly, just the process and associated threads, and if possible, the stacks of said threads). As a result, you can’t use it for general kernel memory examination. However, it’s sometimes enough to do the trick if you just need to capture what the kernel mode side stack trace of a particular thread is.

To use this feature, you can invoke “kdbgctrl.exe -td pid dump-filename“, which takes a snapshot of the process identified by a pid, and writes it out to a dump file named by dump-filename. This support is very tersely documented if you invoke kdbgctrl.exe on the command line with no arguments:

kdbgctrl
Usage: kdbgctrl <options>
Options:
[...]
-td <pid> <file> - Get a kernel triage dump

Now, the kernel’s support for writing out a triage dump isn’t by any means directly comparable to the power afforded by kd.exe -kl. As previously mentioned, the dump is extremely minimalistic, only containing information about a process object and its associated threads and the bare minimums allowing the debugger to function enough to pull this information from the dump. Just about all that you can reasonably expect to do with it is examine the stacks of the process that was captured, via the “!process -1 1f” command. (In addition, some extra information, such as the set of pending APCs attached to each thread, is also saved – enough to allow “!apc thread” to function.) Sometimes, however, it’s just enough to be able to figure out what something is blocking on kernel mode side (especially when troubleshooting deadlocks), and the triage dump functionality can aid with that.

However, there are some limitations with triage dumps, even above the relatively terse amount of data they convey. The triage dump support needs to be careful not to crash the system when writing out the dump, so all of this information is captured within the normal confines of safe kernel mode operation. In particular, this means that the triage dump writing logic doesn’t just blithely walk the thread or process lists like kd -kl would, or perform other potentially unsafe operations.

Instead, the triage dump creation process operates with the cooperation of all threads involved. This is immediately apparent when taking a look at the thread stacks captured in a triage dump:

BugCheck 69696969, {0, 0, 0, 0}

Probably caused by : ntkrnlmp.exe ( nt!IopKernSnapAPCMiniDump+55 )

Followup: MachineOwner
---------

2: kd> k
Call Site
nt!IopKernSnapAPCMiniDump+0x55
nt!IopKernSnapSpecialApc+0x50
nt!KiDeliverApc+0x1e2
nt!KiSwapThread+0x491
nt!KeWaitForSingleObject+0x2da
nt!KiSuspendThread+0x29
nt!KiDeliverApc+0x420
nt!KiSwapThread+0x491
nt!KeWaitForMultipleObjects+0x2d6
nt!ObpWaitForMultipleObjects+0x26e
nt!NtWaitForMultipleObjects+0xe2
nt!KiSystemServiceCopyEnd+0x13
0x7741602a

(Note that the system doesn’t actually bugcheck as a result of creating a triage dump. Kernel dumps implicitly need a bug check parameters record, however, so one is synthesized for the dump file.)

As is immediately obvious from the call stack, thread stack collection in triage dumps operates (in Vista/Srv08) by tripping a kernel APC on the target thread, which then (as it’s in-thread-context) safely collects data about that thread’s stack.

This, of course, presents a limitation: If a thread is sufficiently wedged such that it can’t run kernel APCs, then a triage dump won’t be able to capture full information about that thread. However, for some classes of scenarios where simply capturing a kernel mode stack is sufficient, triage dumps can sometimes fill in the gap in conjuction with user mode debugging, even if the system wasn’t booted with /DEBUG.

Understanding the kernel address space on 32-bit Windows Vista

February 23rd, 2009

[Warning: The dynamic kernel address space is subject to future changes in future releases. In fact, the set of possible address space regions types on public 32-bit Win7 drops is different from the original release of the dynamic kernel address space logic in Windows Vista. You should not make hardcoded assumptions about this logic in production code, as the basic premises outlined in this post are subject to change without warning on future operating system revisions.]

(If you like, you may skip the history lesson and background information and skip to the gory details if you’re already familiar with the subject.)

With 32-bit Windows Vista, significant changes were made to the way the kernel mode address space was laid out. In previous operating system releases, the kernel address space was divvied up into large, relatively fixed-size regions ahead of time. For example, the paged- and non-paged pools resided within fixed virtual address ranges that were, for the most part, calculated at boot time. This meant that the memory manager needed to make a decision up front about how much address space to dedicate to the paged pool versus the non-paged pool, or how much address space to devote to the system cache, and soforth. (The address space is not strictly completely fixed on a 32-bit legacy system. There are several registry options that can be used to manually trade-off address space from one resource to another. However, these settings are only taken into account at memory manager initialization time during system boot.)

As a consequence of these mostly static (after boot time) calculations, the memory manager had limited flexibility to respond to differing memory usage workloads at runtime. In particular, because the memory manager needed to dedicate large address space regions up front to one of several different resource sets (i.e. paged pool, non-paged pool, system cache), situations may arise wherein one of these resources may be heavily utilized to the point of exhaustion of the address range carved out for it (such as the paged pool), but another “peer” resource (such as the non-paged pool) may be experiencing only comparatively light utilization. Because the memory manager has no way to “take back” address space from the non-paged pool, for example, and hand it off to the paged pool if it so turns out allocations from the paged pool (for example) may fail while there’s plenty of comparatively unused address space that has all been blocked off for the exclusive usage of the non-paged pool.

On 64-bit platforms, this is usually not a significant issue, as the amount of kernel address space available is orders of magnitude beyond the total amount of physical memory available in even the largest systems available (for today, anyway). However, consider that on 32-bit systems, the kernel only has 2GB (and sometimes only 1GB, if the system is booted with /3GB) of address space available to play around with. In this case, the fact that the memory manager needs to carve off hundreds of megabytes of address space (out of only 2 or even 1GB total) up-front for pools and the system cache becomes a much more eye-catching issue. In fact, the scenario I described previously is not even that difficult to achieve in practice, as the size of the non-paged or paged pools are typically quite small compared to the amount of physical memory available to even baseline consumer desktop systems nowadays. Throw in a driver or workload that heavily utilizes paged- or non-paged pool for some reason, and this sort of “premature” pool exhaustion may occur much sooner than one might naively expect.

Now, while it’s possible to manually address this issue to a degree on 32-bit system using the aforementioned registry knobs, determining the optimum values for these knobs on a particular workload is not necessarily an easy task. (Most sysadmins I know of tend to have their eyes glaze over when step one includes “Attach a kernel debugger to the computer.”)

To address this growing problem, the memory manager was restructured by the Windows Vista timeframe to no longer treat the kernel address space as a large (or, not so large, depending upon how one wishes to view it) slab to carve up into very large, fixed-size regions at boot time. Hence, the concept of the dynamic kernel address space was born. The basic idea is that instead of carving the relatively meagre kernel address space available on 32-bit systems up into large chunks at boot time, the memory manager “reserves its judgement” until there is need for additional address space for a partial resource (such as the paged- or non-paged pool), and only then hands out another chunk of address space to the component in question (such as the kernel pool allocator).

While this has been mentioned publicly for some time, to the best of my knowledge, nobody had really sat down and talked about how this feature really worked under the hood. This is relevant for a couple of reasons:

  1. The !address debugger extension doesn’t understand the dynamic kernel address space right now. This means that you can’t easily figure out where an address came from in the debugger on 32-bit Windows Vista systems (or later systems that might use a similar form of dynamic kernel address space).
  2. Understanding the basic concepts of how the feature works provides some insight into what it can (and cannot do) to help alleviate the kernel address space crunch.
  3. While the kernel address space layout has more or less been a relatively well understood concept for 32-bit systems to date, much of that knowledge doesn’t really directly translate to Windows Vista (and potentially later) systems.


Please note that future implementations may not necessarily function the same under the hood with respect to how the kernel address space operates.

Internally, the way the memory manager provides address space to resources such as the non-paged or paged- pools has been restructured such that each of the large address space resources act as clients to a new kernel virtual address space allocator. Wherein these address space resources previously received their address ranges in the form of large, pre-calculated regions at boot time, instead, each now calls upon the memory manager’s internal kernel address space allocator (MiObtainSystemVa) to request additional address space.

The kernel address space allocator can be thought of as maintaining a logical pool of unused virtual address space regions. It’s important to note that the address space allocator doesn’t actually allocate any backing store for the returned address space, nor any other management structures (such as PTEs); it simply reserves a chunk of address space exclusively for the use of the caller. This architecture is required due to the fact that everything from driver image mapping to the paged- and non-paged pool backends have been converted to use the kernel address space allocator routines. Each of these components has very different views of what they’ll actually want to use the address space for, but they all commonly need an address space to work in.

Similarly, if a component has obtained an address space region that it no longer requires, then it may return it to the memory manager with a call to the internal kernel address space allocator routine MiReturnSystemVa.

To place things into perspective, this system is conceptually analogous to reserving a large address space region using VirtualAlloc with MEM_RESERVE in user mode. A MEM_RESERVE reservation doesn’t commit any backing store that allows data to reside at a particular place in the address space, but it does grant exclusive usage of an address range of a particular size, which can then be used in conjunction with whatever backing store the caller requires. Likewise, it is similarly up to the caller to decide how they wish to back the address space returned by MiObtainSystemVa.

The address space chunks dealt with by the kernel address space allocator, similarly to the user mode address space reservation system, don’t necessarily need to be highly granular. Instead, a granularity that is convenient for the memory management is chosen. This is because the clients of the kernel address space allocator will then subdivide the address ranges they receive for their own usage. (The exact granularity of a kernel address space allocation is, of course, subject to change in future releases based upon the whims of what is preferable for the memory manager)

For example, if a driver requests an allocation from the non-paged pool, and there isn’t enough address space assigned to the non-paged pool to handle the request, then the non-paged pool allocator backend will call MiObtainSystemVa to retrieve additional address space. If successful, it will conceptually add this address space to its free address space list, and then return a subdivided chunk of this address space (mated with physical memory backing store, as in this example, we are speaking of the non-paged pool) to the driver. The next request for memory from the non-paged pool might then come from the same address range previously obtained by the non-paged pool allocator backend. This behavior is, again, conceptually similar to how the user mode heap obtains large amounts of address space from the memory manager and then subdivides these address regions into smaller allocations for a caller of, say, operator new.

All of this happens transparently to the existing public clients of the memory manager. For example, drivers don’t observe any particularly different behavior from ExAllocatePoolWithTag. However, because the address space isn’t pre-carved into large regions ahead of time, the memory manager no longer has its hands proverbially tied with respect to allowing one particular consumer of address space to obtain comparatively much more address space than usual (of course, at a cost to the available address space to other components). In effect, the memory manager is now much more able to self-tune for a wide variety of workloads without the user of the system needing to manually discover how much address space would be better allocated to the paged pool versus the system cache, for example.

In addition, the dynamic kernel address space infrastructure has had other benefits as well. As the address spans for the various major consumers of kernel address space are now demand-allocated, so too are PTE and other paging-related structures related to those address spans. This translates to reduced boot-time memory usage. Prior to the dynamic kernel address space’s introduction, the kernel would reduce the size of the various mostly-static address regions based on the amount of physical memory on the system for small systems. However, for large systems, and especially large 64-bit systems, paging-related structures potentially describing large address space regions were pre-allocated at boot time.

On small 64-bit systems, the demand-allocation of paging related structures also removes address space limits on the sizes of individual large address space regions that were previously present to avoid having to pre-allocate vast ranges of paging-related structures.

Presently, on 64-bit systems, many, though not all of the major kernel address space consumers still have their own dedicated address regions assigned internally by the kernel address space allocator, although this could easily change as need be in a future release. However, demand-creation of paging-describing structures is still realized (with the reduction in boot-time memory requirements as described above,

I mentioned earlier that the !address extension doesn’t understand the dynamic kernel address space as of the time of this writing. If you need to determine where a particular address came from while debugging on a 32-bit system that features a dynamic kernel address space, however, you can manually do this by looking into the memory manager’s internal tracking structures to discern for which reason a particular chunk of address space was checked out. (This isn’t quite as granular as !address as, for example, kernel stacks don’t have their own dedicated address range (in Windows Vista). However, it will at least allow you to quickly tell if a particular piece of address space is part of the paged- or non-paged pool, whether it’s part of a driver image section, and soforth.)

In the debugger, you can issue the following (admittedly long and ugly) command to ask the memory manager what type of address a particular virtual address region is. Replace <<ADDRESS>> with the kernel address that you wish to inquire about:

?? (nt!_MI_SYSTEM_VA_TYPE) ( ((unsigned char *)(@@masm(nt!MiSystemVaType)))[ @@masm( ( <<ADDRESS>> - poi(nt!MmSystemRangeStart)) / (@$pagesize *

@$pagesize / @@c++(sizeof(nt!_MMPTE))) ) ] )

Here’s a couple of examples:


kd> ?? (nt!_MI_SYSTEM_VA_TYPE) ( ((unsigned char *)(@@masm(nt!MiSystemVaType)))
[ @@masm( ( 89445008 - poi(nt!MmSystemRangeStart)) / (@$pagesize * @$pagesize / @@c++(sizeof(nt!_MMPTE))) ) ] )
_MI_SYSTEM_VA_TYPE MiVaNonPagedPool (5)

kd> ?? (nt!_MI_SYSTEM_VA_TYPE) ( ((unsigned char *)(@@masm(nt!MiSystemVaType)))
[ @@masm( ( ndis - poi(nt!MmSystemRangeStart)) / (@$pagesize * @$pagesize / @@c++(sizeof(nt!_MMPTE))) ) ] )
_MI_SYSTEM_VA_TYPE MiVaDriverImages (12)

kd> ?? (nt!_MI_SYSTEM_VA_TYPE) ( ((unsigned char *)(@@masm(nt!MiSystemVaType)))
[ @@masm( ( win32k - poi(nt!MmSystemRangeStart)) / (@$pagesize * @$pagesize / @@c++(sizeof(nt!_MMPTE))) ) ] )
_MI_SYSTEM_VA_TYPE MiVaSessionGlobalSpace (11)

The above technique will not work on 64-bit systems that utilize a dynamic (demand-allocated) kernel address space, as the address space tracking is performed differently internally.

(Many thanks to Andrew Rogers and Landy Wang, who were gracious enough to spend some time divulging insights on the subject.)