Archive for the ‘Programming’ Category

NWScript JIT engine: IR instruction raising

Saturday, August 14th, 2010

Previously, we learned about the high-level structure of the NWScript IR. This time, we’ll take a closer look at the IR raising phase.

The IR raising process is the domain of the NWScript analyzer component. The IR raising sequence has several distinct phases, each of which makes one pass through the logical structure of a script program:

  1. A structural analysis phase, which discovers all subroutines in the program and their parameter and return value sizes (in stack cells). The structural analysis phase also produces control flow objects as branches and branch targets are encountered. Any STORE_STATE resume labels are converted to first-class subroutines during structural analysis as well (with locals inherited from the parent subroutine promoted to parameters). We covered most the structural analysis phase last post for the most part.
  2. A code analysis phase. The code analysis phase performs the bulk of the remaining IR raising work, including creating variable objects for each distinct stack location (at a given control flow point), discovering and propagating the types of variables, discovery and creation of global variables, and actual IR generation.
  3. A post-processing phase. Unlike the other two phases, the post-processing phase operates primarily on the IR and not the NWScript instruction stream. The post-processing phase performs basic optimization on the emitted IR, such as folding of temporary variables generated by NWScript stack operations into their parent variables. Additionally, it generates various hints to JIT backend, such as whether an IR variable is written to exactly once, or whether an IR variable is written to but never read from.

At this point in the project, we executed a divide and conquer strategy; Justin wrote the code analysis and post-processing phases of the NWScript analyzer (and restructured much of what I had written for the structural analysis phase) once we had come up with a design, while I went on to implement the MSIL (.NET) JIT backend that consumed the IR.

I’ll let Justin delve into the full details of the code analysis and post-processing phases and just provide a quick summary here.

Variable stack

The code analysis phase traverses the NWScript instruction stream while maintaing a logical variable stack, that maps NWScript stack locations at the current program counter to IR-level variables. The variable stack facilitates “de-stacking” of NWScript stack variables, plus lifetime management for these variables (once a variable is popped off of the variable stack, the code analyzer knows that its logical lifetime has ended, thus allowing this information to be conveyed in the form of an emitted I_DELETE IR instruction).

Parameter and return value variables can simply be placed on the variable stack (and appropriately flagged) when analysis begins at a particular subroutine; because the structural analysis phase has discovered the number of individual stack cells devoted to parameters and return values for each subroutine, tracking of these constructs within the code analyzer simply boils down to creating untyped variables on the variable stack before NWScript instruction in a particular subroutine is analyzed.

Type-equivalence chains

The code analysis phase discovers variable types by linking variables that can be inferred to have a similar type together into a type equivalence chain. (Type equivalence chain links may be formed by operations such as a subroutine call, or an assignment, where both sides of the operation are known to yield the same type, though that type might not be immediately known.)

Initially, the type equivalence chain may have an indetermine type, but once a typed stack reference is made to any variable in a chain, the types of all associated variables become known. (Only variables that are copied on the NWScript stack but never referenced for a specific operation can remain untyped after code analysis completes when using this system. Since these variables are, by nature, extraneous, leaving them as untyped does not present a problem to program correctness.)

The type-equivalence chain system can be used for strong type discovery because the NWScript stack itself is strongly typed; any time an operation will actually be performed on a variable, its type is specified in the instruction operands embedded in the instruction stream. Thus, once the type of any variable in a type equivalence chain has been uncovered, the types of all other variables are instantly known. (One exception to this is struct/struct comparisons, which are simply a comparison over a range of stack space without a specific type provided. Other references to all variables involved will always have resolved the types involved unless one of the structs was passed as a parameter to the entry point symbol; entry point symbols do not support struct parameters though, precluding that from becoming a problem.)

IR instruction generation

IR instruction generation in the code analysis phase operates by decoding the current NWScript instruction, picking up the actual IR-level variable operands from the variable stack, and creating an IR-level instruction object linking the variables together with the IR instruction descriptor.

Some NWScript instructions may turn into multiple IR instructions (such as a struct/struct NWScript-level OP_EQUAL/OP_NEQUAL comparison instruction, which must produce I_LOGOR/I_LOGAND instructions to combine the result of each successive I_EQUAL/I_NEQUAL comparision in order to form the final result).

Post-processing phase

Similarly, the post-processing phase walks the generated, now functional IR and attempts to fold extraneous temporaries into their parent variables (among other optimization-related tasks). Temporary folding is useful because the raw translation of NWScript instructions to IR-level instructions yields numerous temporaries (reflections of NWScript instructions copying data to the top of the stack for a particular operation); these temporaries can be proven to be removable in many cases. While the JIT backend generally emits code to a system with its own embedded optimizer, eliminating these temporaries reduces the size of the IR (and the output of a non-optimizing JIT backend).

It’s important to note that this overview is a simplification; there are some tricky cases that need special handling, such as merging two control flows (and their associated variables) together.

Once the IR is produced, the JIT backend takes over. The JIT backend is handed a completed IR for the program to JIT; it then assumes responsibility for lowering the NWScript IR into the native form emitted by the JIT backend.

Coming up: Examining the MSIL (.NET) JIT backend in more detail.

NWScript JIT engine: Intermediate Representation (IR) design

Wednesday, August 11th, 2010

Last time, I discussed how parameterized scripts turn out to actually work in NWScript, plus I delved into some details as to how the NWScript analyzer performs its first analysis pass, the structural analysis phase.

Although the NWScript VM need not go further than the structural analysis phase in order to correctly execute parameterized scripts[1], if we were to wish to create a JIT system for NWScript that worked to any degree of efficiency, it’s necessary to perform some more advanced analysis. This additional analysis is needed in order to “flatten” the script into a set of distinct variables that are manipulated by a fundamental instruction set.

The result of the structural analysis phase and the additional analysis phase (which I term the code analysis phase) is the NWScript IR, or intermediate representation. The intermediate representation (IR for short) is a set of data structures that describe the entire functional behavior of a script program in a format that is easy to translate into other forms.

Because the JIT system is designed to be easily extensible to multiple backends, the NWScript analyzer’s analysis phases do not directly emit any machine instructions; it is the job of the JIT backend to translate the resultant IR into a JIT-specific program representation. For the purposes of this post, we’ll focus on the IR itself.

The NWScript IR is designed with several principles in mind:

  • The IR should be designed in such a way that it’s easy to consume from a variety of code generation backends.
  • Each IR instruction should (generally) be independent of any previous instructions. That is, one should not need to track an inordinate amount of state specific to simply processing the IR in order to process each IR instruction. (As you’ll find out, this guideline is partially violated in two occasions for simplicity’s sake, but it generally holds true.) This implies that a non-optimizing backend should generally be able to consume the IR in more or less a single pass, without requiring significant postprocessing work.

With these loose characteristics in mind, let’s look at the basic layout of the IR. The IR consists of a set of in-memory C++ data structures (currently, no serialization format is designed; the backend must be able to directly interface with C++ data structures). These data structures encapsulate the following core concepts:

  • A NWScriptVariable, describing a strongly typed symbolic location in which data can be retained. Unlike the NWScript instruction set’s stack-based system, variables can be referenced freely. Variables can be marked as constant, and can fall under various classifications (locals, globals, parameters, return values, call site (outgoing) parameters, call site (outgoing) return values, etc.) Locals have well-defined lifetimes that are bounded by a series of specific instructions, whereas other variable classes have other, generally obvious lifetime semantics.
  • A NWScriptInstruction, which contains an underlying IR opcode, plus data and variable linkages that form the operands of the instruction. Not all IR instructions are direct analogs of NWScript instructions; some instructions have multiple mappings, or no direct mappings at all.
  • A NWScriptControlFlow. The control flow object represents a contiguous span of IR instructions that are neither the target of a branch, nor the source of a branch. From the IR perspective, a control flow contains a list of IR instructions, a description of how the control flow terminates (i.e. fallthrough, unconditional branch, conditional brach, end of function, merge, etc.), and a list of links to up to two successor control flows. (The successor links point to both forks of a conditional control flow transfer, or one of the links points to the fall through / unconditional endpoint of an unconditional control flow transfer.)
  • A NWScriptSubroutine, representing an IR subroutine. An IR subroutine can represent either a full-fleged NWScript subroutine, or the resume label of a STORE_STATE instruction. In either case, a subroutine groups together a list of control flows, strongly typed parameter variables, strongly typed return types, and several other pieces of information (such as a name, if we could learn the name of a subroutine from a debug symbol file for the script we are analyzing, should we have been able to locate symbols (not required)).

There are several other minor data structures which I will not describe here, as they are not particularly important for understanding the design of the IR. Additionally, various other attributes are associated with these above data items (such as a list of parent links for the control flow object), but these are similarly not required for a basic understanding.

The IR instruction set itself consists of the following operations:

typedef enum _INSTR
{
  I_CREATE,       // Create variable with type (no value)
  I_DELETE,       // Delete variable
  I_ASSIGN,       // Copy (assign) variables
  I_JZ,           // Jump if zero
  I_JNZ,          // Jump if not zero
  I_JMP,          // Jump unconditionally [deprecated]
  I_CALL,         // Call subroutine
  I_RETN,         // Return from subroutine
  I_ACTION,       // Call script action
  I_SAVE_STATE,   // Save state for script situation
  I_LOGAND,       // Logical AND (&&)
  I_LOGOR,        // Logical OR (||)
  I_INCOR,        // Bitwise OR (|)
  I_EXCOR,        // Bitwise XOR (^)
  I_BOOLAND,      // Bitwise AND (&)
  I_EQUAL,        // Compare (==)
  I_NEQUAL,       // Compare (!=)
  I_GEQ,          // Compare (>=)
  I_GT,           // Compare (>)
  I_LT,           // Compare (<)
  I_LEQ,          // Compare (<=)
  I_SHLEFT,       // Shift left (<<)
  I_SHRIGHT,      // Shift signed right (>>, SAR)
  I_USHRIGHT,     // Shift unsigned right (>>)
  I_ADD,          // Add (+), concatenate strings
  I_SUB,          // Subtract (-)
  I_MUL,          // Multiply (*)
  I_DIV,          // Divide (/)
  I_MOD,          // Modulus (%)
  I_NEG,          // Negation (-)
  I_COMP,         // Complement (~)
  I_NOT,          // Logical NOT (!)
  I_INC,          // Increment
  I_DEC,          // Decrement
  I_TEST,      // Set zero/not zero based on variable value
  I_INITIALIZE,   // Set variable to default value

  LASTINSTR
} INSTR, * PINSTR;

typedef const enum _INSTR * PCINSTR;

If you’ve taken a look at Tim Smith’s NWScript instruction set reference, many of the instructions should look familiar to you, in the form of a condensed set of the fundamental NWScript instruction opcodes. However, there are several new instructions, and as I had mentioned previously, you may notice that some NWScript instruction opcodes have no direct equivalents (such as the purely stack manipulation instructions).

Most instructions are relatively straightforward to understand from the comments above, so I will not describe each and every one of them; additionally, for most (data-processing) instructions, a uniform API is provided to retrieve the input/output variable objects affected, making code generation simple.

Several instructions are, however, more structural than data-processing in nature and bear further explanation; these have been highlighted in blue.

The I_CREATE and I_DELETE instructions bound the lifetime of a local variable. When an I_CREATE is issued, a variable is activated for use until an I_DELETE is encountered. These are provided as hints to the backend as to when a variable need remain active. These instructions are loose exceptions to the general design guideline that an instruction should not involve previous state, and can be thought of as optimization hints (versus creating all local variables up front from a code generation perspective).

I_SAVE_STATE represents the operation of the STORE_STATE and STORE_STATEALL NWScript instructions; it supplies a list of local and global variables used by the resume point of a STORE_STATE instruction, as well as a subroutine descriptor describing the resume point’s IR-level instruction code. (The analyzer raises a STORE_STATE resume point into a first class subroutine, with all local variables inherited from the ‘parent’ subroutine having been turned into parameters.) This allows the backend to easily generate code to save the required locals away at runtime for the next operation that consumes a saved state.

I_TEST is used by the conditional jump instructions. As a special exception to the rule that one instruction does not depend on the state of another, the I_TEST instruction must be immediately followed by a conditional jump (I_JZ or I_JNZ) instruction. Its purpose is to take an integer-typed variable operand and make it into a Boolean / logical 1 or zero value for use in the immediately following conditional jump. The purpose of a separate I_TEST instruction is to prevent temporary variables generated by a conditional operation from having to always be bifurcated across both control flow forks in the IR.

I_INITIALIZE supplies a hard initialization sequence point for a variable at which a variable must be assigned its type-specific default value. This IR instruction fulfills two purposes; it allows the backend to avoid emitting non-useful reinitialization code if the backend’s representation of variables are reused in a variable pool of some sort, and it allows the JIT backend to circumvent an issue with the IR raising phase relating to merging of variables across flows in some circumstances. (It is probable that the I_INITIALIZE instruction will go away and turn into simply a flag on a variable object once the merging problem I alluded to is no longer an issue. Justin tells me that he’s planning on writing a post describing the variable merging issue I mentioned, so I’ll let him provide the details there.)

Fear not, there is yet even more NWScript discussion upcoming; stay tuned…

[1] Actually, to be pedantic, there was a rather ugly hack I had inserted in the VM that manages to execute parameterized scripts with wrong parameter counts so long as they do not return a value. In the case of NWN2, this is a particularly important (and the most likely to be hit) case. However, this hack is obsolete with the structural analysis phase providing data on the number of parameters that the entry point symbol takes, so I will not detail the hack here.

NWScript JIT engine: Parameterized script entry point invocation and subroutine structural analysis

Monday, August 9th, 2010

Previously, I discussed how subroutine calls and extension (action service handler) calls are made in the NWScript instruction set. This time, we’ll dive in deeper to a feature that NWN2 added to NWScript: Parameterized entry point invocation, or the capability to pass parameters to a script’s main subroutine.

Originally, in the NWN1 days, NWScript did not support providing arguments to a script’s entry point symbol. Instead, when data needed to be made available to a script (say, a script that would be fired when a particular event occured), a special purpose action service handler would be created that would return the parameter to the script program. For example, a script might be called when a player attempts to open a door, and the script might need to deal with both the player object and the door being opened. To handle this case, the designers of NWN1 would create special action service handlers that would return these sorts of event parameters (see an example).

This approach has some advantages and disadvantages. On the upside, it’s very easy to expose new parameter information to scripts called for specific events. On the downside, however, this model is not very intuitive to script authors; the result is that NWN1 (and NWN2 in many cases due to backwards compatibility with NWN1) has a number of very specialized ‘get event information’ action service handlers that can be called meaningfully only at specific points in time. Figuring out which event handler is associated with which dedicated action service handler has been the source of innumerable bugs in NWN scripts.

In NWN2, the game designers added the capability to pass parameters directly to the entry point of a NWScript program, like any other NWScript subroutine, as a mechanism for passing information to script events. However, they had several constraints in doing so:

  • It should be easy to add new parameters without breaking already compiled scripts.
  • Significant changes to the compiler and execution environment should be avoided to the degree possible.

To that end, the NWN2 designers came up with a solution that required no changes to the NWScript compiler toolchain: They kept the existing calling convention for script subroutines and simply used it for the entry point symbol.

While this is an easy solution from the compiler point of view, it creates some unfortunate complications for the execution environment. Specifically, the entry point symbol is not actually the first bit of code executed by a NWScript program. Instead, there are two compiler-emitted stubs that are produced before the entry point:

  1. First, a #loader routine runs. The #loader routine’s purpose is to allocate storage for the entry point symbol’s return value, if the entry point had one, and then call it. (Remember that the caller must allocate storage for the return value of a script subroutine.)
  2. Second, if the script uses global variables, a #globals routine is called by #loader instead of the entry point. The #globals routine establishes a local variable frame containing all global variables in the script, then assigns the auxiliary SP pointer (BP) to its current SP value, before finally invoking the real entry point symbol.

If you’re paying attention, you may begin to see the problem that will arise here. Because the entry point symbol isn’t the first symbol called when the script is invoked, the execution environment can’t simply put the parameters on the stack before calling into the script the first time.

Why is this the case, you might ask? There are several issues here:

First, if the script returned a value, the return value will be pushed by #loader before the user’s entry point is called. This means that the return value of the entry point would appear to be a parameter to the entry point (recall that the return value should is the last item on the stack when calling a subroutine, and it is allocated by the caller of a subroutine).

Secondly, if the script program used global variables, were the execution environment to push parameters before first entering the script, the parameters would take the place of several global variables, and some of the global variables would turn into parameters to the script entry point.

In either case, we clearly have what might technically be described as a train wreck; the script program would behave wildly incorrectly if #loader and #globals were not modified to be aware of parameters passed to the entry point symbol.

While this could have been done, the designers of NWN2 elected not to do so. (Even if these two compiler-generated stubs were modified, the calling convention for a particular (event handler) script’s entry point would become fixed for all time — it would not be possible to add another parameter to the entry point symbol without breaking already compiled scripts.)

Instead, a series of modifications were made to the execution environment itself to cope with the fact that the compiler toolchain wasn’t updated, and the fact that the script might be designed to be called with a different count of parameters.

Specifically, when a parameterized script is going to be invoked, the execution environment must perform the following steps:

  1. The actual number of parameters that the entry point symbol is determined, and the number and types of parameters passed to the script are adjusted accordingly. In the case of the stock NWN2 implementation, this is accomplished by requiring the source text for a parameterized script to be available; the source text compiler is then used to discover the parameter counts and types to the entry point symbol.
  2. The execution environment arranges for parameters to be pushed after the entry point symbol is first branched to by watching for either #loader or #globals (which have several uniquely recognizable characteristics) to complete.

The first requirement, in particular, is a highly undesirable and onerous one (a compiled instruction set language should not require the execution environment to possess the source text).

For my implementation of a NWScript VM (and later, the JIT engine), I solved the first problem with a different approach (the second problem continues to be addressed by watching for #loader or #globals to finish execution within my VM):

Instead of relying on embedding a source compiler in my VM, I tagged the parameters to the entry point symbol as having an indetermine type on the VM stack. NWN2 defines conversion rules from strings to script parameters, as in some cases parameterized scripts are invoked from a list of strings. Finally, once an indetermine-typed parameter is referenced on the stack with, it is simply changed into the type that it was referenced as (since all stack references must be strongly typed, there is no ambiguity other than for a bulk copy operation, which doesn’t require a specific known type).

This doesn’t solve the problem of determining the count of parameters to the script, though. For that, my implementation performs a basic structural analysis phase of the script program instruction set. (The structural analysis phase — a subset of the NWScript Analyzer component — will also become critical to the JIT engine itself, as we’ll see later.)

The structural analysis operates as a static analysis step, wherein the flow of each call instruction in a script program is followed. This results in the discovery of several useful pieces of information:

  1. All subroutines in the compiled script program are discovered by walking through the instruction stream.
  2. The count of parameters and return values to each subroutine in the script program are discovered (i.e. the stack displacement of each subroutine across a call site).

The core of the structural analysis phase relies on the following properties of the NWScript instruction set:

  1. All references to the stack in the NWScript instruction set must specify a displacement that is a compile-time constant. In particular, this means that constructs such as the creation of a variable amount of locals within a subroutine cannot be directly encoded into the instruction set[1].
  2. All subroutines within the NWScript program follow the standard calling convention (except for #loader and #globals, which can be identified by virtue of always being the first subroutine invoked, or being the only subroutine to establish a global variables frame with the special purpose SAVEBP instruction). Notably, this means that if a subroutine is called with P parameter cells on the stack and R return cells on the stack, that on return from the subroutine, there will have been a write to a stack location R cells down from the top of the stack, and P cells will have been removed from the top of stack relative to subroutine entry.
  3. No subroutine in the script program is unconditionally recursive without any other exit path (such a subroutine would make little sense in practice, so this is seen as a reasonable limitation). This is important, because analyzing the structural (stack displacment) properties of a subroutine requires tracing through to the end of the subroutine in at least one flow path successfully.
  4. The structural analyzer must be provided with data on the stack displacement properties of all action service handlers. This is considered an acceptable limitation as the VM and the execution environment must be used in conjuction with action service handler implementations anyway. In this particular case, a program to designed to perform a very basic parse of the prototypes of the action handlers exposed by NWN and NWN2 was used to generate the data tables for those games.

Given these properties, it becomes possible to construct an algorithm to discover the structural properties (that is, the location, parameter count, and return count) of each subroutine in the script program. The algorithm can be described roughly as follows (excluding #loader and #globals which can be special cased safely):

  1. At the start, set virtual SP=0.
  2. Walk one instruction at a time forward via disassembly.
  3. If the instruction references the stack…
    • If the current instruction references the stack at a greater displacement than the previous greatest displacement within this subroutine, record the new displacement.
    • If the current instruction pointer modifies the stack, update the current virtual SP appropriately.
    • If the current instruction is a branch instruction, and we have not yet visited that branch, mark the branch as visited and queue a traversal of the branch (start at step #2 with the virtual SP at the time when the branch instruction was encountered). If the branch was unconditional and we had already visited the branch, switch to the next analysis queue entry.
  4. If the current instruction is an action service handler call (ACTION), update the virtual SP based on the action service handler data table.
  5. If the current instruction is a subroutine call…
    • If the subroutine has not been analyzed yet, then queue a new analysis run (at step #1) for the new subroutine, queue a new analysis run for the current call site address and mark it as blocking on the new subroutine (taking the current virtual SP), and then switch to processing the next unanalyzed branch that is not blocking on a subroutine. If there are no unanalyzed branches, we have detected a case of unconditional recursion and analysis fails.
    • If the subroutine has been analyzed already, then adjust the virtual SP accordingly.
  6. When a RET instruction is encountered, record the largest SP offset encountered and record the displacement of the virtual SP at the return point from 0. If virtual SP is 0, then there were no parameters. If virtual SP is -1 cell, then there was 1 cell of parameters, etc. Additionally, if the largest SP offset was greater than the displacement of the virtual SP from 0, then the remainder is the count of return cells for the subroutine. Finally, once all the above tasks are completed when the RET instruction is observed, switch to the next analysis queue entry, or if we have none left, analysis is complete.

This algorithm, repeated iteratively, discovers the structure of the script program. With the structure of the program discovered, the VM execution environment can fulfill the task of invoking parameterized scripts without requiring a source text compiler on-board with it (a notable improvement over the game’s stock execution environment).

Next time, we’ll look more at the overall design of the NWScript analyzer and the IR that it produces. (If you are wanting to sneak an early look, Justin has a post up describing some more details about the analyzer.)

[1] It is possible to construct a program that accesses SP from a variable offset through creative use of the SAVEBP/RESTOREBP instructions in a hand-assembled script due to a quirk of the stock game VM, which saves the previous BP on the stack as an integer-type value. No such specimens are known to exist in the wild, and none of the known NWScript compiler implementations emit code to do so.

NWScript JIT engine: Subroutines, extension calls, and saved state

Thursday, August 5th, 2010

Previously, I discussed the fundamentals of how the NWScript instruction set operates, particularly with respect to the stack, the primary data store of the NWScript instruction set.

I left several topics out though, including subroutine calls, extension calls to APIs exposed by the script host (action service handler invocations, in NWScript parlance), and saved state (pseudo-coroutines). It’s worth discussing these before continuing further as the nature of how these constructs are implemented shapes the implementation of the JIT environment, in turn.

Subroutine calls

As with most programming languages, NWScript allows the programmer to define subroutines. NWScript’s support for subroutines is most akin to C-style functions (there are no classes in NWScript and thus no concept of member functions).

Unlike C, though, NWScript does not allow the control transfer address in the subroutine call instruction (JSR) to come from a non-constant source. Instead, the subroutine target must be hardcoded into the instruction operand (the instruction stream data space is not accessible to the instruction set itself).

One upshot of this is that, despite the fact that there is no structural listing of all subroutines in a compiled NWScript program, one can still determinstically identify such a listing by walking the instruction stream opcode-by-opcode.

Subroutines in NWScript are all invoked with a uniform calling convention, described as follows:

  1. The caller first allocates stack space for the return values (if any) of the callee. (NWScript subroutines have a single return value from a source level perspective, but a subroutine that returns an aggregate type is implemented by simply returning multiple values.) Return value storage is created on the stack from right to left, and is created typed based on the return types of the subroutine. (Canonically, every stack cell is strongly typed in the NWScript instruction set.)
  2. The caller then pushes the arguments (if any) to the callee in right-to-left order.
  3. The caller executes the subroutine call via a JSR (jump subroutine) instruction. The return program counter is stored in a separate, dedicated stack that is invisible to the NWScript program (the stack pointer is not modified).
  4. The callee completes its processing, and then copies any return values, if any existed, down the stack (to where the caller allocated return value space).
  5. The callee cleans the parameters off the stack, and then executes a RET instruction to return to the caller. If there was a return program counter on the invisible return stack, control is transferred to it, otherwise the script program ends (i.e. return from main).

The main points to keep in mind are that the callee allocates the return value and parameter slot space, and the return values are left on the stack after the call.

The user’s entry point symbol (e.g. main) acts the same as any other subroutine from a calling convention perspective. However, there are some additional considerations when it comes to passing parameters to the entry point symbol, which we’ll discuss later.

Action service handler invocations (extension points)

In order for a script program to be particularly useful, it must have a means to interface with the outside world. This is accomplished via a set of action service handlers that a script host can expose to script programs.

Action service handlers can be logically thought of as an array of function pointers with a well-defined, uniform calling convention that can be invoked by the script VM on behalf of NWScript. As with subroutine calls, parameters and return values of action service handler calls are passed on the VM stack. There is a difference in the VM stack calling convention versus standard subroutine calls, however, in that the action servicer handler is responsible for pushing the return values (if any) on to the VM stack after it cleans the parameters. (Recall that for a standard subroutine call, the callee allocates space for the return value.)

In this respect, one can think of an action service handler call as simply an intrinsic, extended instruction that may be invoked.

To call an action service handler, a NWScript program uses the ACTION instruction, which includes several pieces of information:

  • An action ordinal, indicating which action service handler is to be invoked by the VM.
  • A parameter count, indicating how many (source-level) arguments have been passed to the action service routine. This is the count of arguments that the compiler recognized for the action handler at compile time and may not necessarily reflect the actual count of stack cells (for example, a ‘vector’ argument consumes three stack cells but increments the action parameter count by one).

Because there is no mechanism to perform structured linking of external procedure calls, the action service ordinal embedded into the instruction opcode indicates the extension function to invoke in the script host (similar to how a BIOS might expose functionality to real mode code via a specific interrupt number).

This mechanism provides two mechanisms to allow the set of actions defined by a script host to be expanded without breaking already compiled script programs:

  1. Because actions are identified by ordinal, a script host can define a new action service hander and place it at the end of the action service handler array. Previously compiled script programs will still function, because prior action ordinals remain unchanged.
  2. The ‘parameter count’ operand embedded in the ACTION instruction allows new parameters to be added (defaulted) to the end of the parameter list for a pre-existing action. In this case, by inspecting the parameter count, an action service handler can detect which parameters should be set to a defaulted value (if an old compiled script is being executed), or which are actually on the VM stack (if a new compiled script is being executed).

Additionally, in this design, the script source text compiler and the script VM themselves do not need to have any specific, hardcoded knowledge of the action service handlers. (The source text compiler looks for a list of prototypes in a well-defined header file, and assumes that these prototypes are actual service handler definitions, with the first prototype being action ordinal zero, and so forth.) This means that the script system itself can be structured to be logically isolated from the actual extension API exposed by a script host, making the core script compilation and runtime environment easily portable to new API sets (i.e. new games).

A typical implementation of the script environment would expose an API for action service handlers to push and pop data values from the VM stack, leaving the actual parameter and return value manipulation up to the implementation of a particular action service handler.

For example, an action service handler named ‘AngleToVector’ that might be prototyped like so in NWScript source text:

// Convert fAngle to a vector
vector AngleToVector(float fAngle);

… might typically be implemented like so by a script host:

#define SCRIPT_ACTION( Name )                                  \
	void                                                       \
	NWScriptHost::OnAction_##Name(                             \
	    __in NWScriptVM & ScriptVM,                            \
	    __in NWScriptStack & VMStack,                          \
	    __in NWSCRIPT_ACTION ActionId,                         \
	    __in size_t NumArguments                               \
	    )                                                      
// ...
SCRIPT_ACTION( AngleToVector )
/*++

Routine Description:

	This script action converts an angle to a vector.

Arguments:

	fAngle - Supplies the angle.

Return Value:

	The routine returns the converted vector.

Environment:

	User mode.

--*/
{
	float        fAngle = VMStack.StackPopFloat( ) * (PI / 180.0f);
	NWN::Vector3 Heading;

	Heading.x = cos( fAngle );
	Heading.y = sin( fAngle );
	Heading.z = 0.0f;

	VMStack.StackPushVector( Heading );
}

In a properly structured script execution environment, all of the application-specific extension points, such as the ‘AngleToVector’ action service handler described above, can thus be cleanly separated out from the core execution environment.

Saved states

As I mentioned previously, NWScript is designed to implement an easy-to-program concept for deferred execution. From a source-level, this is implemented by the concept of an ‘action’ argument type that may be passed to an action service handler (this type cannot be used for any purpose except in an action service handler prototype, as far as the source text compiler is concerned).

When the source text compiler encounters an ‘action’-typed value being referenced, it does not generate a typical stack data store instruction. Instead, it emits a special instruction, STORE_STATE, that directs the execution environment to save the current state of the script environment. For a typical VM-based implementation, this can very easily be accomplished by making a copy of the VM stack (which implicitly duplicates the entire data state of the script program).

The STORE_STATE instruction includes a (constant) operand that specifies a program counter to resume execution at. Using a saved VM execution state, the execution of the currently running script can be resumed at the STORE_STATE resume label. Because almost all of the state associated with the logical script invocation is present on the VM stack, resuming execution at a STORE_STATE resume label is very simple to implement for the VM-based execution environment.

From a NWScript source-text compiler perspective, any time that an ‘action’ type is passed to an action service handler, the source compiler allows an arbitrary statement that returns void to be substituted. (Typically, this might be a call to a subroutine that passes local or global variables on the argument list.)

When such a saved state construct is resumed, this arbitrary statement represents the STORE_STATE resume label, allowing a section of the script to be forked off for later execution (in a way that is conceptually similar in many respects to the standard POSIX ‘fork’ system call). The resume label has access to the global and local variables of the script program at the time when the script program’s state was saved, making maintaining state across a deferred execution context easy for the programmer.

From a script execution environment perspective, the execution environment would typically provide a special-purpose API to retrieve a snapshot of the most recently saved state (as created by a STORE_STATE instruction). This API would typically be used by action service handlers that accept an ‘action’-typed arguments.

To see how this paradigm can be used by a script host to provide a very easy to use way to delay execution, consider the following action service handler in NWN1/NWN2:

// Delay aActionToDelay by fSeconds.
void DelayCommand(float fSeconds, action aActionToDelay);

From a source level, it might be common to utilize this action service handler as follows:

DelayCommand( 1.0, DelayedFunctionCall( params ) );

The effect of such a source-level statement is that the ‘DelayedFunctionCall( params )’ section of logic is executed once the DelayCommand time expires (if we are speaking of the NWN1/NWN2 extension API set). Note that the script author is now freed of complex logic to package state up a timer callback or similar construct as other languages might often require; the deferred execution “just works” in a very intuitive manner.

NWScript JIT engine: A (more detailed) overview of NWScript

Tuesday, August 3rd, 2010

Last time, I outlined the basic scope of one of my recent side projects, a JIT engine for NWScript. Although I provided a brief, very high-level overview of the NWScript environment, it’s necessary to understand more details about the inner workings of the instruction set in order for various design decisions in the JIT project to make sense.

Based on the way the NWScript instruction set functions, we can derive the following probable design goals for the instruction set:

  • Extensibility. It should be easy to expose new APIs to the script environment without modifying the compiler or the core script VM logic itself. Furthermore, it should be possible to add new script APIs without breaking previously compiled scripts. Although BioWare’s game engines have evolved significantly since NWScript made its public debut in Neverwinter Nights 1 (“Aurora”), the script instruction set has seen comparatively minor revisions despite a wide range of hosting environments.
  • Portability. The compiled NWScript instruction set should be easily executable on any particular architecture (i.e. there should be no need to write a compiler for each platform the game is ported to, nor should one need to compile again for each platform supported). Neverwinter Nights 1 supported Win32 and Linux, and Dragon Age supports Win32 and Xbox360.
  • Safety (containment). End-users should be free to download custom content with custom script code without fear of the script code taking control over their machine.
  • Suspend / resume (coroutine-like) support. NWScript is designed to easily support timed deferral of various operations at well-defined sequence points in a way that is very natural to program, which is implemented by a coroutine-like saved state mechanism that should be simple to support in the execution environment.

With these design principles in mind, let’s dive in deeper as to how the script instruction set operates. The first major concept that one needs to understand about NWScript is that of the VM stack (or simply stack for short).

Note: For the curious, I have mirrored a copy of Tim Smith’s original NWScript instruction set reference here. There are several errata in Tim’s documentation that have since been identified, but it remains the definitive source on the subject. This posting series generally will not discuss or mention every single instruction in great detail, so the instruction set reference itself is optional reading.

NWScript VM stack

Due to its design nature as an intended interpretive-only scripting environmen, the NWScript instruction set is purely stack-based. That is to say that the primary (and only) data store of a program authored in NWScript is the VM stack, which is an abstract data store that the execution environment provides. This is a common paradigm among other compiled, interpretive languages, designed for ease of use by the interpreter. Remember that NWScript was not originally designed to be executed natively, so any registers that the instruction language provided would most likely be handled by memory indirections anyway.

(To be pedantic, there are three pseudo-registers that a NWScript VM must keep track of — the current program counter (PC), the current top of stack pointer (SP), and the auxiliary stack pointer (BP). These pseudo-register values are generally only implicitly referenced by NWScript instructions in the form of relative address operands.)

The stack is laid out as an expand-up array of uniformly-sized stack cells. Each stack cell holds a single data element, and, as mentioned, all stack cells consume the same amount of space on the logical VM stack (4 addresses per stack cell). In practice, a program is ill-formed if it acesses the stack at a misaligned stack cell offset, so for all intents and purposes the stack cell can be considered the atomic data unit size on the VM stack.

(Why 4 addresses (‘bytes’) per stack cell, you might ask? This is actually an area where NWScript presents a somewhat leaky abstraction; the BioWare NWScript VM internally treats the VM stack as essentially a byte array within which either a raw data element or a pointer to that data element may be stored (the BioWare VM only supports 32-bit targets, making for 4 addresses consumed per stack cell). Nonetheless, all stack references in a program must be aligned to multiples of a size of a stack cell for the program to be considered well-formed.)

Nearly every instruction in the NWScript instruction set that takes a non-constant argument receives that argument from the VM stack (typically the top of the stack). Furthermore, because most instructions only act on the top of the stack, and there is no secondary data store for local variables (as in, say, MSIL), a NWScript program will often spend a non-trivial amount of time moving data around the VM stack.

For example, consider a NWScript subroutine that has two integer local variables, i and j. If this subroutine contains code structured along the lines of i = i + j;, one could expect to see the following general sequence of operations:

  1. Create a copy of j on the top of the stack (via CPTOPSP).
  2. Create a copy of i on the top of the stack (via CPTOPSP).
  3. Add the two variables together (via ADDII). This results in the two integer values at the top of the stack being popped off, and their sum being pushed on to the top of the stack.
  4. Copy the result back down to the permanent copy of i (via CPDOWNSP).
  5. Delete the temporary result of the addition from the top of the stack (via MOVSP).

While this logic is perfectly functional, it leaves plenty of room for improvement by a JIT system (four out of five instructions generated are merely artifacts of the stack-based instruction set versus performing the actual work of the operation).

Extraneous temporary copies could be removed (particularly if the language that the JIT backend translates to supports any sort of optimization of its own), and a number of type and validity checks on each stack reference that the interpretive VM must perform on most instructions could be dispensed with (having been enforced at JIT compilation time).

Data types

There are several fundamental data types recognized by the script instruction set. Each data type is logically stored by value on the stack, and each of these data types consumes a single stack cell:

  • int, the 32-bit signed integer.
  • float, the 32-bit (single precision) IEEE float.
  • object, a 32-bit unsigned quantity which cannot be directly modified by the core instruction set (except the creation of a constant value of type object with two well-known constant values). There are no built-in instructions that can operate on a value of type object (aside from constant creation and raw equality testing on the 32-bit unsigned bit pattern), but an extension API exposed to the script program by the script host may make use of object typed values. Generally, an object-typed value is used as a handle to reference a game entity (such as a door or a creature), but the script instruction set imposes no particular meaning on this type.
  • string, a counted 8-bit character string. Strings are logically stored by value on the stack (and as such, they are essentially immutable, in that any operation that ‘modifies’ a string simply creates a new string with the modified value).
  • vector, a set of three loating point (float) values. A vector is not actually a discrete type, but a rather triplet of three float values allocated adjacent on the stack. Several instructions have forms optimized to operate on this data construct, but a vector is simply otherwise created and manipulated as three float stack cells on the stack. (There is no hard binding of the three floats, rather they are simply three float values created adjacent to one another.)

Additionally, the instruction set allows for a customizable set of up to 10 engine structure types. Engine structure types are opaque to the instruction set itself (and thus, by extension, the source text compiler). They are used to safely package up a data structure maintained by the script host for transmission across a script program authored in NWScript. An engine structure type allows the creation of an empty (or null) value, and supports equality testing by means of a well-defined callout to the script host to compare the equality of two engine structures of the same type.

Additionally, there is an internal data type that may be stored on the stack: stack pointer, which represents a saved value of the current stack pointer. This data type is generally not directly referenced by script code, and is used only during the creation of global variables. (My script VM, NWScriptVM, treats the stack pointer type as a unique type, whereas the BioWare VM technically treats it as an int. A well-formed script program should not read or write a value of this type directly, and it is used only as a side effect by certain instructions.)

Conspicuously missing from the list of data types available for storage is that of an instruction pointer. While NWScript supports the concept of subroutines and subroutine calls, the return instruction pointer is saved in a separate region that is inaccessible to the script program itself. Furthermore, all control transfers must come to either specific extension points, or hardwired subroutine addresses specified as operands to instructions (i.e. there is no concept of a virtual method call or a function pointer in NWScript).

Global variables

If you’ve been following along, you may have noticed a seemingly contradictory set of points. Namely, that I have made reference to the fact that a script program can support global variables, despite that a script is essentially a completely unstructured instruction stream.

In the NWScript instruction set, global variables are referenced relative to an auxiliary stack pointer value (internally known as BP, or base pointer). This auxiliary stack pointer value generally points to the top of a call frame set up by a special compiler-emitted subroutine (internally called #globals) that runs before the main program entry point. The #globals subroutine creates a data elements on its stack frame, and then uses a special-purpose pair of instructions (SAVEBP and RESTOREBP) in order to establish or de-establish a global variable frame, by assigning the current top of stack address to the active base pointer value.

Once #globals has finished preparing a script program’s global variables, it transfers control to the user’s entry point subroutine (i.e. main()).

Aggregates

Although the NWScript source text compiler supports the notion of aggregate types, the instruction set only has a very minimal awareness of aggregates at the instruction level.

All fields of an aggregate type are allocated adjacent on the stack. Operand forms are provided for several instructions to allow them to be repeated over a range of stack locations (to compare structures, for example), and several instructions have special support for performing built-in operations on a the built-in vector structure (a tuple of three floats); for example, there is a derivative of the MUL instruction to support multiplying one float and a series of three other floats in a fashion consistent with vector/scalar multiplication.

Next time, we’ll take a closer look at how subroutine calls, script host extension point (action service handler) calls, and saved states operate.

Erratum for Tim Smith’s NWScript documentation

The following are known errata with respect to Tim Smith’s documentation on the NWScript instruction set:

  • SAVEBP and RESTOREBP are incorrectly documented as not modifying the current value of SP. The current BP value (on entry to SAVEBP) is pushed onto the stack, and similarly restored from the current top of stack by SAVEBP.
  • Several legal opcode + type opcode combinations are omitted from the documentation.

Side projects: NWScript JIT engine (introduction)

Monday, August 2nd, 2010

As an interesting experiment in program analysis, one of the side projects that I’ve been working on in my spare time is an execution environment for NWScript, the scripting language for Neverwinter Nights / Neverwinter Nights 2 and the Dragon Age series of games. (I happen to have spent some time analyzing NWN/NWN2 recently, hence this project.)

NWScript 101

From a source-level perspective, NWScript provides a C-like programming environment (there are no native arrays or pointers in the NWN/NWN2 iterations of NWScript). Like C, NWScript source text must be compiled to an object-code form in order for it to be consumed by the execution environment.

As a result, only the source text compiler for NWScript must deal with the program source code directly; the actual execution environment only deals with the NWScript object code. I’ll go into more details as to how the object code is structured in a future posting, but here’s a brief overview necessary to understand the scope of the project:

The NWScript object code format is essentially simply a raw instruction stream (think an old-style .com file), consisting of a series of NWScript instructions. The instruction set (in NWN/NWN2) is a stack-based, strongly-typed, small-complexity set of 45 distinct, variable-length instructions. (Some instructions perform several overloaded functions based on their operand types.) All control transfers in NWScript must be performed to fixed locations hardwired at compile time.

These properties combine to make the NWScript object code format readily analyzable (given sufficient effort), despite the fact that it is essentially provided in a completely unstructured, raw instruction-stream based form.

Project scope

The execution environment that I set out on creating involves several different components:

  • A NWScript VM, capable of interpretive execution of the NWScript object code format. (The BioWare games in question also use an interpretive VM style of execution environment.) While potentially slow(er), the NWScript VM supports even atypical, hand-built NWScript programs (as opposed to those emitted by a compiler).
  • A NWScript analyzer, capable of inspecting a NWScript program in object code form and performing several levels of analysis, up to producing a high-level, full-featured IR (or intermediate representation) describing the NWScript program’s behavior. The IR raising phase implemented by the NWScript analyzer is agnostic to any particular following backend phase, and it provides an opportunity to optimize the generated IR while it is in a backend-neutral form.
  • A NWScript JIT backend, capable of translating the high-level NWScript IR into machine-level instructions for native execution. The JIT backend is intended to be essentially a simple translator to a JIT environment that consumes NWScript IR (generated by the analysis phase), and ultimately emits native code. For this project, the first NWScript JIT backend that I have created emits MSIL that can be compiled to native code by the .NET JIT environment.

Out of scope for this particular posting series is the concept of a NWScript host, which is a program that invokes the NWScript execution environment in order to run a script. The NWScript host provides a series of extension points (action service handlers) that may be called by the script program in order to interface with the outside world. An example of a NWScript host might be the NWN1/NWN2 game servers.

An advanced NWScript host might utilize both the NWScript VM and the NWScript JIT system provided by this project (for example, scripts could be interpretively executed by the script VM until background JIT compilation finishes, after which subsequent execution could fall over to the JIT’d version of the script program).

Next time, I’ll go into some more gory details on the nature of the NWScript object code.

Credits

Tim Smith (Torlack), who now works at BioWare, documented the NWScript instruction set several years ago, and Justin Olbrantz contributed much of the code to the IR-raising portion of this project.

Why hooking system services is more difficult (and dangerous) than it looks

Saturday, July 26th, 2008

System service hooking (and kernel mode hooking in general) is one of those near-and-dear things to me which I’ve got mixed feelings about. (System service hooking refers to intercepting system calls, like NtCreateFile, in kernel mode using a custom driver.)

On one hand, hooking things like system services can be extremely tricky at best, and outright dangerous at worst. There are a number of ways to write code that is almost right, but will fail in rare edge cases. Furthermore, there are even more ways to write code that will behave correctly as long as system service callers “play nicely” and do the right thing, but sneaks in security holes that only become visible once somebody in user mode tries to “bend the rules” a bit.

On the other hand, though, there are certainly some interesting things that one can do in kernel mode, for which there really aren’t any supported or well-defined extensibility interfaces without “getting one’s hands dirty” with a little hooking, here or there.

Microsoft’s policy on hooking in kernel mode (whether patching system services or otherwise) is pretty clear: They would very much rather that nobody even thought about doing anything remotely like code patching or hooking. Having seen virtually every anti-virus product under the sun employ hooking in one way or another at some point during their lifetime (and subsequently fail to do it correctly, usually with catastrophic consequences), I can certainly understand where they are coming from. Then again, I have also been on the other side of the fence once or twice, having been blocked from implementing a particular useful new capability due to a certain anti-patching system on recent Windows versions that shall remain nameless, at least for this posting.

Additionally, in further defense of Microsoft’s position, the vast majority of code I have seen in the field that has performed some sort of kernel mode hooking has done some out of a lack of understanding of the available and supported kernel mode extensibility interfaces. There are, furthermore, a number of things where hooking system services won’t even get the desired result (even if it weren’t otherwise fraught with problems); for instance, attempting to catch file I/O by hooking NtReadFile/NtReadFileScatter/NtWriteFile/NtWriteFileGather will cause one to completely miss any I/O that is performed using a mapped section view.

Nonetheless, system service hooking seems to be an ever-increasing trend, one that doesn’t seem to be likely to go away any time soon (at least for 32-bit Windows). There are also, of course, bits of malicious code out there which attempt to do system service hooking as well, though in my experience, once again, the worst (and most common) offenders have been off-the-shelf software that was likely written by someone who didn’t really understand the consequences of what they were doing.

Rather than weigh in on the side of “Microsoft should allow all kernel mode hooking”, or “All code hooking is bad”, however, I thought that a different viewpoint might be worth considering on this subject. At the risk of incurring Don Burn’s wrath by even mentioning the subject (an occurrence that is quite common on the NTDEV list (if one happens to follow that), I figured that it might be instructive to provide an example of just how easy it is to slip up in a seemingly innocent way while writing a system service hook. Of course, “innocent slip-ups” in kernel mode often equate to bugchecks at their best, or security holes at their worst.

Thus, the following, an example of how a perfectly honest attempt at hooking a system service can go horribly wrong. This is not intended to be a tutorial in how to hook system services, but rather an illustration of how one can easily create all sorts of nasty bugs even while trying to be careful.

The following code assumes that the system service in question (NtCreateSection) has had a hook “safely” installed (BuggyDriverHookNtCreateSection). This posting doesn’t even touch on the many and varied problems of safely trying to hook (or even worse, unhook) a system service, which are also typically done wrong in my experience. Even discounting the diciness of those problems, there’s plenty that can go wrong here.

I’ll post a discussion of some of the worst bugs in this code later on, after people have had a chance to mull it over for a little while. This routine is a fairly standard attempt to post-process a system service by doing some additional work after it completes. Feel free to post a comment if you think you have found one of the problems (there are several). Bonus points if you can identify why something is broken and not just that it is broken (e.g. provide a scenario where the existing code breaks). Even more bonus points if you can explain how to fix the problem you have found without introducing yet another problem or otherwise making things worse – by asking that, I am more wanting to show that there are a great deal of subleties at play here, instead of simply showing how to operate an NtCreateSection hook correctly. Oh, and there’s certainly more than one bug here to be found, as well.

N.B. I haven’t run any of this through a compiler, so excuse any syntax errors that I missed.

Without further adeu, here’s the code:

//
// Note: This code has bugs. Please don't actually try this at home!
//

NTSTATUS
NTAPI
BuggyDriverNtCreateSection(
 OUT PHANDLE SectionHandle,
 IN ACCESS_MASK DesiredAccess,
 IN POBJECT_ATTRIBUTES ObjectAttributes,
 IN PLARGE_INTEGER SectionSize OPTIONAL,
 IN ULONG Protect,
 IN ULONG Attributes,
 IN HANDLE FileHandle
 )
{
 NTSTATUS Status;
 HANDLE Section;
 SECTION_IMAGE_INFORMATION ImageInfo;
 ULONG ReturnLength;
 PVOID SectionObject;
 PVOID FileObject;
 BOOLEAN SavedSectionObject;
 HANDLE SectionKernelHandle;
 LARGE_INTEGER LocalSectionSize;

 SavedSectionObject = FALSE;

 //
 // Let's call the original NtCreateSection, as we only care about successful
 // calls with valid parameters.
 //

 Status = RealNtCreateSection(
  SectionHandle,
  DesiredAccess,
  ObjectAttributes,
  SectionSize,
  Protect,
  Attributes,
  FileHandle
  );

 //
 // Failed? We'll bail out now.
 //

 if (!NT_SUCCESS( Status ))
  return Status;

 //
 // Okay, we've got a successful call made, let's do our work.
 // First, capture the returned section handle. Note that we do not need to
 // do a probe as that was already done by NtCreateSection, but we still do
 // need to use SEH.
 //

 __try
 {
  Section = *SectionHandle;
 }
 __except( EXCEPTION_EXECUTE_HANDLER )
 {
  Status = (NTSTATUS)GetExceptionCode();
 }

 //
 // The user unmapped our buffer, let's bail out.
 //

 if (!NT_SUCCESS( Status ))
  return Status;

 //
 // We need a pointer to the section object for our work. Let's grab it now.
 //

 Status = ObReferenceObjectByHandle(
  Section,
  0,
  NULL,
  KernelMode,
  &SectionObject,
  NULL
  );

 if (!NT_SUCCESS( Status ))
  return Status;

 //
 // Just for fun, let's check if the section was an image section, and if so,
 // we'll do special work there.
 //

 Status = ZwQuerySection(
  Section,
  SectionImageInformation,
  &ImageInfo,
  sizeof( SECTION_IMAGE_INFORMATION ),
  &ReturnLength
  );

 //
 // If we are an image section, then let's save away a pointer to to the
 // section object for our own use later.
 //

 
 if (NT_SUCCESS(Status))
 {
  //
  // Save pointer away for something that we might do with it later. We might
  // want to care about the section image information for some unspecified
  // reason, so we will copy that and save it in our tracking list. For
  // example, maybe we want to map a view of the section into the initial
  // system process from a worker thread.
  //

  Status = SaveImageSectionObjectInList(
   SectionObject,
   &ImageInfo
   );

  if (!NT_SUCCESS( Status ))
  {
   ObDereferenceObject( SectionObject );
   return Status;
  }

  SavedSectionObject = TRUE;
 }

 //
 // Let's also grab a kernel handle for the file object so that we can do some
 // sort of work with it later on.
 //

 Status = ObReferenceObjectByHandle(
  Section,
  0,
  NULL,
  KernelMode,
  &FileObject,
  NULL
  );

 if (!NT_SUCCESS( Status ))
 {
  if (SavedSectionObject)
   DeleteImageSectionObjectInList( SectionObject );

  ObDereferenceObject( SectionObject );

  return Status;
 }

 //
 // Save the file object away, as well as maximum size of the section object.
 // We need the size of the section object for a length check when accessing
 // the section later.
 //

 if (SectionSize)
 {
  __try
  {
   LocalSectionSize = *SectionSize;
  }
  __except( EXCEPTION_EXECUTE_HANDLER )
  {
   Status = (NTSTATUS)GetExceptionCode();

   ObDereferenceObject( FileObject );

   if (SavedSectionObject)
    DeleteImageSectionObjectInList( SectionObject );

   ObDereferenceObject( SectionObject );

   return Status;
  }
 }
 else
 {
  //
  // Ask the file object for it's length, this could be done by any of the
  // usual means to do that.
  //

  Status = QueryAllocationLengthFromFileObject(
   FileObject,
   &LocalSectionSize
   );

  if (!NT_SUCCESS( Status ))
  {
   ObDereferenceObject( FileObject );

   if (SavedSectionObject)
    DeleteImageSectionObjectInList( SectionObject );

   ObDereferenceObject( SectionObject );

   return Status;
  }
 }

 //
 // Save the file object + section object + section length away for future
 // reference.
 //

 Status = SaveSectionFileInfoInList(
  FileObject,
  SectionObject,
  &LocalSectionSize
  );

 if (!NT_SUCCESS( Status ))
 {
  ObDereferenceObject( FileObject );

  if (SavedSectionObject)
   DeleteImageSectionObjectInList( SectionObject );

  ObDereferenceObject( SectionObject );

  return Status;
 }

 //
 // All done. Lose our references now. Assume that the Save*InList routines
 // took their own references on the objects in question. Return to the caller
 // successfully.
 //

 ObDereferenceObject( FileObject );
 ObDereferenceObject( SectionObject );

 return STATUS_SUCCESS;
}

How to not write code for a mobile device

Monday, December 10th, 2007

Earlier this week, I got a shiny, brand-new XV6800 to replace my aging (and rather limited in terms of running user-created programs) BREW-ified phone.

After setting up ActiveSync, IPsec, and all of the other usual required configuration settings, the next step was, of course, to install the baseline minimum set of applications on the device to get by. Close to the top of this list is an SSH client (being able to project arbitrary console programs over SSH and screen to the device is, at the very least, a workable enough solution in the interm for programs that are not feasible to run directly on the device, such as my SILC client). I’ve found PuTTY a fairly acceptable client for “big Windows”, and there just so happened to be a Windows Mobile port of it. Great, or so I thought…

While PocketPuTTY does technically work, I noticed a couple of deficiencies with it pretty quickly:

  1. The page up / page down keys are treated the same as the up arrow / down arrow keys when sent to the remote system. This sucks, because I can’t access my scrollback buffer in SILC easily without page up / page down. Other applications can tell the difference between the two keys (obviously, or there wouldn’t be much of a point to having the keys at all), so this one is clearly a PocketPuTTY bug.
  2. There is no way to restart a session once it is closed, or at least, not that I’ve found. Normally, on “big Windows”, there’s a “Restart Session” menu option in the window menu of the PuTTy session, but (as far as I can tell) there’s no such equivalent to the window menu on PocketPuTTY. There is a “Tools” menu, although it has some rather useless menu items (e.g. About) instead of some actually useful menu items, like “Restart Session”.
  3. Running PocketPuTTY seems to have a significantly negative effect on battery life. This is really unfortunate for me, since the expected use is to leave an SSH session to a terminal running screen for a long period of time. (Note that this was resolved, partially, by locating a slightly more maintained copy of PocketPuTTY.)
  4. SSH port forward support seems to be fairly broken, in that as soon as a socket is cleaned up, all receives in the process break until you restart it. This is annoying, but workable if one can go without SSH port forwards.

Most of these problems are actually not all that difficult to fix, and since the source code is available, I’m actually planning on trying my hand at doing just that, since I expect that this is an app that I’ll make fairly heavy use of.

The latter problem is one I really want to call attention to among these deficiencies, however. My intent here is not to bash the PocketPuTTY folks (and I’m certainly happy that they’ve at least gotten a (semi)-working port of PuTTY with source code out, so that other people can improve on it from there), but rather to point out some things that should really just not be done when you’re writing code that is intended to run on a mobile device (especially if the code is intended to run exclusively on a mobile device).

On a portable device, one of the things that most users really expect is long battery life. Though this particular point certainly holds true for laptops as well, it is arguably even more important that for converged mobile phone devices. After all, most people consider their phone an “always on” contact mechanism, and unexpectedly running out of battery life is extremely annoying in this aspect. Furthermore, if your mobile phone has the capability to run all sorts of useful programs on it, but doing so eats up your battery in a couple of hours, then there is really not that much point in having that capability at all, is there?

Returning to PocketPuTTY, one of the main problems (at least with the version I initially used) was, again, that PocketPuTTY would reduce battery life significantly. Looking around in the source code for possible causes, I noticed the following gem, which turned out to be the core of the network read loop for the program.

Yes, there really is a Sleep(1) spin loop in there, in a software port that is designed to run on battery powered devices. For starters, that blows any sort of processor power management completely out of the water. Mobile devices have a lot of different components vying for power, and the easiest (and most effective) way to save on power (and thus battery life) is to not turn those components on. Of course, it becomes difficult to do that for power hungry components like the 400MHz CPU in my XV6800 if there’s a program that has an always-ready-to-run thread…

Fortunately, there happened to be a newer revision of PocketPuTTY floating about with the issue fixed (although getting ahold of the source code for that version proved to be slightly more difficult, as the original maintainer of the project seems to have disappeared). I did eventually manage to get into contact with someone who had been maintaining their own set of improvements and grab a not-so-crusty-old source tree from them to do my own improvements, primarily for the purposes of fixing some of the annoyances I mentioned previously (thus beginning my initial forey into Windows CE development).

Most Win32 applications are really multithreaded, even if they don’t know it

Monday, November 12th, 2007

Most Win32 programs out there operate with more than one thread in the process at least some of the time, even if the program code doesn’t explicitly create a thread. The reason for this is that a number of system APIs internally create threads for their purposes.

For example, console Ctrl-C/Ctrl-Break events are handled in a new thread by default. (Internally, CSRSS creates a new thread in the address space of the recipient process, which then calls a kernel32 function to invoke the active console handler list. This is one of the reasons why kernel32 is guaranteed to be at the same base address system wide, incidentally, as CSRSS caches the address of the Ctrl-C dispatcher in kernel32 and assumes that it will be valid across all client processes[1].) This can introduce potentially unexpected threading concerns depending on what a console control handler does. For example, if a console control handler writes to the console, this access will not necessarily be synchronized with any I/O the initial thread is doing with the console, unless some form of explicit synchronization is used.

This is one of the reasons why Visual Studio 2005 eschews the single threaded CRT entirely, instead favoring the multithreaded variants. At this point in the game, there is comparatively little benefit to providing a single threaded CRT that will break in strange ways (such as if a console control handler is registered), just to save a minuscule amount of locking overhead. In a program that is primarily single threaded, the critical section package is fast enough that any overhead incurred by the CRT acquiring locks is going to be minuscule compared to the actual processing overhead of whatever operation is being performed.

An even greater subset of Win32 APIs use worker threads internally, though these typically have limited visibility to the main process image. For example, WSAAsyncGetHostByName uses a worker thread to synchronously call gethostbyname and then post the results to the requester via a window message. Even debugging an application typically results in threads being created in the target process for purpose of debugger break-in processing.

Due to the general trend that most Win32 programs will encounter multiple threads in one fashion or another, other parts of the Win32 programming environment besides the CRT are gradually dropping the last vestiges of their support for purely single threaded operation as well. For instance, the low fragmentation heap (LFH) does not support HEAP_NO_SERIALIZE, which disables the usage of the heap’s built-in critical section. In this case, as with the CRT, the cases where it is always provably safe to go without locks entirely are so limited and provide such a small benefit that it’s just not worth the trouble to take out the call to acquire the internal critical section. (Entering an unowned critical section does not involve a kernel mode transition and is quite cheap. In a primarily single threaded application, any critical sections will by definition almost always be either unowned or owned by the current thread in a recursive call.)

[1] This is not strictly true in some cases with the advent of Wow64, but that is a topic for another day.

Most data references in x64 are RIP-relative

Monday, November 5th, 2007

One of the larger (but often overlooked) changes to x64 with respect to x86 is that most instructions that previously only referenced data via absolute addressing can now reference data via RIP-relative addressing.

RIP-relative addressing is a mode where an address reference is provided as a (signed) 32-bit displacement from the current instruction pointer. While this was typically only used on x86 for control transfer instructions (call, jmp, and soforth), x64 expands the use of instruction pointer relative addressing to cover a much larger set of instructions.

What’s the advantage of using RIP-relative addressing? Well, the main benefit is that it becomes much easier to generate position independent code, or code that does not depend on where it is loaded in memory. This is especially useful in today’s world of (relatively) self-contained modules (such as DLLs or EXEs) that contain both data (global variables) and the code that goes along with it. If one used flat addressing on x86, references to global variables typically required hardcoding the absolute address of the global in question, assuming the module loads at its preferred base address. If the module then could not be loaded at the preferred base address at runtime, the loader had to perform a set of base relocations that essentially rewrite all instructions that had an absolute address operand component to refer to take into account the new address of the module.

The loader is hardly capable of figuring out what instructions would need to be rewritten in such a form, instead requiring assistance from the compiler and linker (in terms of the base relocation section of a PE image, for Windows) to provide it with a list of addresses that correspond to instruction operands that need to be modified to reflect the new image base after an image has been relocated.

An instruction that uses RIP relative addressing, however, typically does not require any base relocations (otherwise known as “fixups”) at load time if the module containing it is relocated, however. This is because as long as portions of the module are not internally re-arranged in memory (something not supported by the PE format), any addresses reference that is both relative to the current instruction pointer and refers to a location within the confines of the current image will continue to refer to the correct location, no matter where the image is placed at load time.

As a result, many x64 images have a greatly reduced number of fixups, due to the fact that most operations can be performed in an RIP-relative fashion. For example, the base relocation information (not including alignment padding) on the 64-bit ntdll.dll (for Windows Vista) is a mere 560 bytes total, compared to 18092 bytes in the Wow64 (x86) version.

Fewer fixups also means better memory usage when a binary is relocated, as there is a higher probability that a particular page will not need to be modified by the base relocation process, and thus can still remain shared even if a particular process needs to relocate a particular DLL.