Archive for the ‘Reverse Engineering’ Category

Hotpatching MS08-067

Friday, October 24th, 2008

If you have been watching the Microsoft security bulletins lately, then you’ve likely noticed yesterday’s bulletin, MS08-067. This is a particularly nasty bug, as it doesn’t require authentication to exploit in the default configuration for Windows Server 2003 and earlier systems (assuming that an attacker can talk over port 139 or port 445 to your box).

The usual mitigation for this particular vulnerability is to block off TCP/139 (NetBIOS) and TCP/445 (Direct hosted SMB), thus cutting off remote access to the srvsvc pipe, a prerequisite for exploiting the vulnerability in question. In my case, however, I had a box that I really didn’t want to reboot immediately. In addition, for the box in question, I did not wish to leave SMB blocked off remotely.

Given that I didn’t want to assume that there’d be no possible way for an untrusted user to be able to establish a TCP/139 or TCP/445, this left me with limited options; either I could simply hope that there wasn’t a chance for the box to get compromised before I had a chance for it to be convenient to reboot, or I could see if I could come up with some form of alternative mitigation on my own. After all, a debugger is the software development equivalent of a swiss army knife and duct-tape; I figured that it would be worth a try seeing if I could cobble together some sort of mitigation by manually patching the vulnerable netapi32.dll. To do this, however, it would be necessary to gain enough information about the flaw in question in order to discern what the fix was, in the hope of creating some form of alternative countermeasure for the vulnerability.

The first stop for gaining more information about the bug in question would be the Microsoft advisory. As usual, however, the bulletin released for the MS08-067 issue was lacking in sufficiently detailed technical information as required to fully understand the flaw in question to the degree necessary down to the level of what functions were patched, aside from the fact that the vulnerability resided somewhere in netapi32.dll (the Microsoft rationale behind this policy is that providing that level of technical detail would simply aid the creation of exploits). However, as Pusscat presented at Blue Hat Fall ’07, reverse engineering most present-day Microsoft security patches is not particularly insurmountable.

The usual approach to the patch reverse engineering process is to use a program called bindiff (an IDA plugin) that analyzes two binaries in order to discover the differences between the two. In my case, however, I didn’t have a copy of bindiff handy (it’s fairly pricey). Fortunately (or unfortunately, depending on your persuasion), there already existed a public exploit for this bug, as well as some limited public information from persons who had already reverse engineered the patch to a degree. To this end, I had a particular function in the affected module (netapi32!NetpwPathCanonicalize) which I knew was related to the vulnerability in some form.

At this point, I brought up a copy of the unpatched netapi32.dll in IDA, as well as a patched copy of netapi32.dll, then started looking through and comparing disassembly one page at a time until an interesting difference popped up in a subfunction of netapi32!NetpwPathCanonicalize:

Unpatched code:

.text:000007FF7737AF90 movzx   eax, word ptr [rcx]
.text:000007FF7737AF93 xor     r10d, r10d
.text:000007FF7737AF96 xor     r9d, r9d
.text:000007FF7737AF99 cmp     ax, 5Ch
.text:000007FF7737AF9D mov     r8, rcx
.text:000007FF7737AFA0 jz      loc_7FF7737515E

Patched code:

.text:000007FF7737AFA0 mov     r8, rcx

.text:000007FF7737AFA3 xor     eax, eax

.text:000007FF7737AFA5 mov     [rsp+arg_10], rbx
.text:000007FF7737AFAA mov     [rsp+arg_18], rdi
.text:000007FF7737AFAF jmp     loc_7FF7738E5D6

.text:000007FF7738E5D6 mov     rcx, 0FFFFFFFFFFFFFFFFh
.text:000007FF7738E5E0 mov     rdi, r8
.text:000007FF7738E5E3 repne scasw

.text:000007FF7738E5E6 movzx   eax, word ptr [r8]
.text:000007FF7738E5EA xor     r11d, r11d

.text:000007FF7738E5ED not     rcx

.text:000007FF7738E5F0 xor     r10d, r10d

.text:000007FF7738E5F3 dec     rcx

.text:000007FF7738E5F6 cmp     ax, 5Ch

.text:000007FF7738E5FA lea     rbx, [r8+rcx*2+2]

.text:000007FF7738E5FF jnz     loc_7FF7737AFB4

Now, without even really understanding what’s going on here on the function as a whole, it’s pretty obvious that here’s where (at least one) modification is being made; the new code involved the addition of an inline wcslen call. Typically, security fixes for buffer overrun conditions involve the creation of previously missing boundary checks, so a new call to a string-length function such as wcslen is a fairly reliable indicator that one’s found the site of the fix for the the vulnerability in question.

(The repne scasw instruction set scans a memory region two-bytes at a time until a particular value (in rax) is reached, or the maximum count (in rcx, typically initialized to (size_t)-1) is reached. Since we’re scanning two bytes at a time, and we’ve initialized rax to zero, we’re looking for an 0x0000 value in a string of two-byte quantities; in other words, an array of WCHARs (or a null terminated Unicode string). The resultant value on rcx after executing the repne scasw can be used to derive the length of the string, as it will have been decremented based on the number of WCHARs encountered before the 0x0000 WCHAR.)

My initial plan was, assuming that the fix was trivial, to simply perform a small opcode patch on the unpatched version of netapi32.dll in the Server service process on the box in question. In this particular instance, however, there were a number of other changes throughout the patched function that made use of the additional length check. As a result, a small opcode patch wasn’t ideal, as large parts of the function would need to be rewritten to take advantage of the extra length check.

Thus, plan B evolved, wherein the Microsoft-supplied patched version of netapi32.dll would be injected into an already-running Server service process. From there, the plan was to detour buggy_netapi32!NetpwPathCanonicalize to fixed_netapi32!NetpwPathCanonicalize.

As it turns out, netapi32!NetpwPathCanonicalize and all of its subfunctions are stateless with respect to global netapi32 variables (aside from the /GS cookie), which made this approach feasible. If the call tree involved a dependancy on netapi32 global state, then simply detouring the entire call tree wouldn’t have been a valid option, as the globals in the fixed netapi32.dll would be distinct from the globals in the buggy netapi32.dll.

This approach also makes the assumption that the only fixes made for the patch were in netapi32!NetpwPathCanonicalize and its call tree; as far as I know, this is the case, but this method is (of course) completely unsupported by Microsoft. Furthermore, as x64 binaries are built without hotpatch nop stubs at their prologue, the possibility for atomic patching in of a detour appeared to be out, so this approach has a chance of failing in the (unlikely) scenario where the first few instructions of netapi32!NetpwPathCanonicalize were being executed at the time of the detour.

Nonetheless, the worst case scenario would be that the box went down, in which case I’d be rebooting now instead of later. As the whole point of this exercise was to try and delay rebooting the system in question, I decided that this was an acceptable risk in my scenario, and elected to proceed. For the first step, I needed a program to inject a DLL into the target process (SDbgExt does not support !loaddll on 64-bit targets, sadly). The program that I came up with is certainly quick’n’dirty, as it fudges the thread start routine in terms of using kernel32!LoadLibraryA as the initial start address (which is a close enough analogue to LPTHREAD_START_ROUTINE to work), but it does the trick in this particular case.

The next step was to actually load the app into the svchost instance containing the Server service instance. To determine which svchost process this happens to be, one can use “tasklist /svc” from a cmd.exe console, which provides a nice formatted view of which services are hosted in which processes:

C:\WINDOWS\ms08-067-hotpatch>tasklist /svc
[…]
svchost.exe 840 AeLookupSvc, AudioSrv, BITS, Browser,
CryptSvc, dmserver, EventSystem, helpsvc,
HidServ, IAS, lanmanserver,
[…]

That being done, the next step was to inject the DLL into the process. Unfortunately, the default security descriptor on svchost.exe instances doesn’t allow Administrators the access required to inject a thread. One way to solve this problem would have been to write the code to enable the debug privilege in the injector app, but I elected to simply use the age-old trick of using the scheduler service (at.exe) to launch the program in question as LocalSystem (this, naturally, requires that you already be an administrator in order to succeed):

C:\WINDOWS\ms08-067-hotpatch>at 21:32 C:\windows\ms08-067-hotpatch\testapp.exe 840 C:\windows\ms08-067-hotpatch\netapi32.dll
Added a new job with job ID = 1

(21:32 was one minute from the time when I entered that command, minutes being the minimum granularity for the scheduler service.)

Roughly one minute later, the debugger (attached to the appropriate svchost instance) confirmed that the patched DLL was loaded successfully:

ModLoad: 00000000`04ff0000 00000000`05089000 
C:\windows\ms08-067-hotpatch\netapi32.dll

Stepping back for a moment, attaching WinDbg to an svchost instance containing services in the symbol server lookup code path is risky business, as you can easily deadlock the debugger. Proceed with care!

Now that the patched netapi32.dll was loaded, it was time to detour the old netapi32.dll to refer to the new netapi32.dll. Unfortunately, WinDbg doesn’t support assembling amd64 instructions very well (64-bit addresses and references to the extended registers don’t work properly), so I had to use a separate assembler (HIEW, [Hacker’s vIEW]) and manually patch in the opcode bytes for the detour sequence (mov rax, <absolute addresss> ; jmp rax):

0:074> eb NETAPI32!NetpwPathCanonicalize 48 C7 C0 40 AD FF 04 FF E0
0:074> u NETAPI32!NetpwPathCanonicalize
NETAPI32!NetpwPathCanonicalize:
000007ff`7737ad30 48c7c040adff04
mov rax,offset netapi32_4ff0000!NetpwPathCanonicalize
000007ff`7737ad37 ffe0            jmp     rax
0:076> bp NETAPI32!NetpwPathCanonicalize
0:076> g

This said and done, all that remained was to set a breakpoint on netapi32!NetpwPathCanonicalize and give the proof of concept exploit a try against my hotpatched system (it survived). Mission accomplished!

The obvious disclaimer: This whole procedure is a complete hack, and not recommended for production use, for reasons that should be relatively obvious. Additionally, MS08-067 did not come built as officially “hotpatch-enabled” (i.e. using the Microsoft supported hotpatch mechanism); “hotpatch-enabled” patches do not entail such a sequence of hacks in the deployment process.

(Thanks to hdm for double-checking some assumptions for me.)

Thread Local Storage, part 7: Windows Vista support for __declspec(thread) in demand loaded DLLs

Tuesday, October 30th, 2007

Yesterday, I outlined some of the pitfalls behind the approach that the loader has traditionally taken to implicit TLS, in Windows Server 2003 and earlier releases of the operating system.

With Windows Vista, Microsoft has taken a stab at alleviating some of the issues that make __declspec(thread) unusable for demand loaded DLLs. Although solving the problem may initially appear simple at first (one would tend to think that all that would need to be done would be to track and procesS TLS data for new modules as they’re loaded), the reality of the situation is unfortunately a fair amount more complicated than that.

At heart is the fact that implicit TLS is really only designed from the get-go to support operation at process initialization time. For example, this becomes evident when ones considers what would need to be done to allocate a TLS slot for a new module. This is in and of itself problematic, as the per-module TLS array is allocated at process initialization time, with only enough space for the modules that were present (and using TLS) at that time. Expanding the array is in this case a difficult thing to safely do, considering the code that the compiler generates for accessing TLS data.

The problem resides in the fact that the compiler reads the address of the current thread’s ThreadLocalStoragePointer and then later on dereferences the returned TLS array with the current module’s TLS index. Because all of this is done without synchronization, it is not in general safe to just switch out the old ThreadLocalStoragePointer with a new array and then release the old array from another thread context, as there is no way to ensure that the thread whose TLS array is being modified was not in the middle of accessing the TLS array.

A further difficulty presents itself in that there now needs to be a mechanism to proactively go out and place a new TLS module block into each running thread’s TLS array, as there may be multiple threads active when a module is demand-loaded. This is further complicated by the fact that said modifications are required to be performed before DllMain is called for the incoming module, and while the loader lock is still held by the current thread. This implies that, once again, the alterations to the TLS arrays of other threads will need to be performed by the current thread, without the cooperation of additional threads that are active in the process at the time of the DLL load.

These constraints are responsible for the bulk of the complexity of the new loader code in Windows Vista for TLS-related operations. The general concept behind how the new TLS support operates is as follows:

First, a new module is loaded via LdrLoadDll (which is used to implement LoadLibrary and similar Win32 functions). The loader examines the module to determine if it makes use of implicit TLS. If not, then no TLS-specific handling is performed and the typical loaded module processing occurs.

If an incoming module does make use of TLS, however, then LdrpHandleTlsData (an internal helper routine) is called to initialize support for the new module’s implicit TLS usage. LdrpHandleTlsData determines whether there is room in the ThreadLocalStoragePointer arrays of currently loaded threads for the new module’s TLS slot (with Windows Vista, the array can initially be larger than the total number of modules using TLS at process initialization time, for cheaper expansion of TLS data when a new module using TLS is demand-loaded). Because all running threads will at any given time have the same amount of space in their ThreadLocalStoragePointer, this is easily accomplished by a global variable to keep track of the array length. This variable is the SizeOfBitMap member of LdrpTlsBitmap, an RTL_BITMAP structure.

Depending on whether the existing ThreadLocalStoragePointer arrays are sufficient to contain the new module, LdrpHandleTlsdata allocates room for the TLS variable block for the new module and possibly new TLS arrays to store in the TEB of running threads. After the new data is allocated for each thread for the incoming module, a new process information class (ProcessTlsInformation) is utilized with an NtSetInformationProcess call to ask the kernel for help in switching out TLS data for any threads that are currently running in the process. Conceptually, this behavior is similar to ThreadZeroTlsCell, although its implementation is significantly more complicated. This step does not really appear to need to occur in kernel mode and does introduce significant (arguably unnecessary) complexity, so it is unclear why the designers elected to go this route.

In response to the ProcessTlsInformation request, the kernel enumerates threads in the current process and either swaps out one member of the ThreadLocalStoragePointer array for all threads, or swaps out the entire pointer to the ThreadLocalStoragePointer array itself in the TEB for all threads. The previous values for either the requested TLS index or the entire array pointer are then returned to user mode.

LdrpHandleTlsData then inspects the data that was returned to it by the kernel. Generally, this data represents either a TLS data block for a module that has been since unloaded (which is always safe to immediately free), or it represents an old TLS array for an already running thread. In the latter case, it is not safe to release the memory backing the array, as without the cooperation of the thread in question, there is no way to determine when the thread has released all possible references to the old memory block. Since the code to access the TLS array is hardcoded into every program using implicit TLS by the compiler, for practical purposes there is no particularly elegant way to make this determinatiion.

Because it is not easily possible to determine (prove) when the old TLS array pointer will never again be referenced, the loader enqueues the pointer into a list of heap blocks to be released at thread exit time when the thread that owns the old TLS array performs a clean exit. Thus, the old TLS array pointer (if the TLS array was expanded) is essentially intentionally leaked until the thread exits. This is a fairly minor memory loss in practice, as the array itself is an array of pointers only. Furthermore, the array is expanded in such a way that most of the time, a new module will take an unused slot in the array instead of requiring the TLS array to be reallocated each time. This sort of intentional leak is, once again, necessary due to the design of implicit TLS not being particular conducive to supporting demand loaded modules.

The loader lock itself is used for synchronization with respect to switching out TLS pointers in other threads in the current process. While a thread owns the loader lock, it is guaranteed that no other thread will attempt to modify the TLS array of it (or any other threads). Because the old TLS array pointers are kept if the TLS array is reallocated, there is no risk of touching deallocated memory when the swap is made, even though the threads whose TLS pointers are being swapped have no synchronization with respect to reading the TLS array in their TEBs.

When a module is unloaded, the TLS slot occupied by the module is released back into the TLS slot pool, but the module’s TLS variable space is not immediately freed until either individual threads for which TLS variable space were allocated exit, or a new module is loaded and happens to claim the outgoing module’s previous TLS slot.

For those interested, I have posted my interpretration of the new implicit TLS support in Vista. This code has not been completely tested, though it is expected to be correct enough for purposes of understanding the details of the TLS implementation. In particular, I have not verified every SEH scope in the ProcessTlsInformation implementation; the SEH scope statements (handlers in particular) are in many cases logical extrapolations of what the expected behavior should be in such cases. As always, it should be considered implementation details and subject to change without notice in future operating system releases.

(There also appear to be several unfortunate bugs in the Vista implementation of TLS, mostly related to inconsistent states and potential corruption if heap allocations fail at “bad” points in time. These are commented in the above code.)

The handler for the ProcessTlsInformation process set information class does not appear to be subfunction in reality, but instead a (rather large) case statement in the implementation of NtSetInformationProcess. It is presented as a subfunction for purposes of clarity. For reference, a control flow graph of NtSetInformationProcess is provided, with the basic blocks relevant to the ProcessTlsInformation case statement shaded. I suspect that this information class holds the record for the most convoluted usage of SEH scopes due to its heavy use of dual input/output parameters.

The information class implementation also appears to take many unconventional shortcuts that while technically workable for the use cases, would appear to be rather inconsistent with the general way that most other system calls and information classes are architected. The reasoning behind these inconsistencies is not known (perhaps as a time saver). For example, unlike most other process information classes, the only valid handle that can be used with this information class is NtCurrentProcess(). In other words, the information class handler implementation assumes the caller is the process to be modified.

Thread Local Storage, part 2: Explicit TLS

Tuesday, October 23rd, 2007

Previously, I outlined some of the general design principles behind both flavors of TLS in use on Windows. Anyone can see the design and high level interface to TLS by reading MSDN, though; the interesting parts relate to the implementation itself.

The explicit TLS API is (by far) the simplest of the two classes of TLS in terms of the implementation, as it touches the fewest “moving parts”. As I mentioned last time, there are really just four key functions in the explicit TLS API. The most important two are TlsGetValue and TlsSetValue, which manage the actual setting and retrieving of per-thread pointers.

These two functions are simple enough to annotate entirely. The essential mechanism behind them is that they are basically just “dumb accessors” into an array (two arrays in actuality, TlsSlots and TlsExpansionSlots) in the TEB, which is indexed by the dwTlsIndex argument to return (or set) the desired per-thread variable. The implementation of TlsGetValue on Vista (32-bit) is as follows (TlsSetValue is similar, except that it writes to the arrays instead of reading from them, and has support for demand-allocating the TlsExpansionSlots array; more on that later):

PVOID
WINAPI
TlsGetValue(
	__in DWORD dwTlsIndex
	)
{
   PTEB Teb = NtCurrentTeb(); // fs:[0x18]

   // Reset the last error state.
   Teb->LastErrorValue = 0;

   // If the variable is in the main array, return it.
   if (dwTlsIndex < 64)
      return Teb->TlsSlots[ dwTlsIndex ];

   if (dwTlsIndex > 1088)
   {
      BaseSetLastNTError( STATUS_INVALID_PARAMETER );
      return 0;
   }

   // Otherwise it's in the expansion array.
   // If it's not allocated, we default to zero.
   if (!Teb->TlsExpansionSlots)
      return 0;

   // Fetch the value from the expansion array.
   return Teb->TlsExpansionSlots[ dwTlsIndex - 64 ];
}

(The assembler version (annotated) is also available.)

The TlsSlots array in the TEB is a part of every thread, which gives each thread a guaranteed set of 64 thread local storage indexes. Later on, Microsoft decided that 64 was not enough TLS slots to go around and added the TlsExpansionSlots array, for an additional 1024 TLS slots. The TlsExpansionSlots array is demand-allocated in TlsAlloc if the initial set of 64 slots is exceeded.

(This is, by the way, the nature of the seemingly arbitrary 64 and 1088 TLS slot limitations mentioned by MSDN, for those keeping score.)

TlsAlloc and TlsFree are, for all intents and purposes, implemented just as what one would expect. They acquire a lock, search for a free TLS slot (returning the index if one is found), otherwise indicating to the caller that there are no free slots. If the first 64 slots are exhausted and the TlsExpansionSlots array has not been created, then TlsAlloc will allocate and zero space for 1024 more TLS slots (pointer-sized values), and then update the TlsExpansionSlots to refer to the newly allocated storage.

Internally, TlsAlloc and TlsFree utilize the Rtl bitmap package to track usage of individual TLS slots; each bit in a bitmap describes whether a particular TLS slot is free or in use. This allows for reasonably fast (and space efficient) mapping of TLS slot usage for book-keeping purposes.

If one has been following along so far, then the question as to what happens when TlsAlloc is called such that it must create the TlsExpansionSlots array after there is already more than one thread in the current process may have come to mind. This might appear to be a problem at first glance, as TlsAlloc only creates the array for the current thread. Although one might be tempted to conclude that, given this behavior of TlsAlloc, explicit TLS therefore doesn’t work reliably above 64 TLS slots if the extra slots are allocated after the second thread in the process is created, this is in fact not the case. There exists some clever sleight of hand that is performed by TlsGetValue and TlsSetValue, which compensates for the fact that TlsAlloc can only create the TlsExpansionSlots memory block for the current thread.

Specifically, if TlsGetValue is called with an array index within the confines of the TlsExpansionSlots array, but the array has not been allocated for the current thread, then zero is returned. (This is the default value for an uninitialized TLS slot, and is thus consequently legal.) Similarly, if TlsSetValue is called with an array index that falls under the TlsExpansionSlots array, and the array has not yet been created, TlsSetValue allocates the memory block on demand and initializes the requested TLS slot.

There also exists one final twist in TlsFree that is required to support the behavior of releasing a TLS slot while there are multiple threads running. A potential problem exists whereby a thread releases a TLS slot, and then it becomes reallocated, following which the previous contents of the TLS slot are still present on other threads running in the process. TlsFree alleviates this problem by asking the kernel for help, in the form of the ThreadZeroTlsCell thread information class. When the kernel sees a NtSetInformationThread call for ThreadZeroTlsCell, it enumerates all threads in the process and writes a zero pointer-length value to each running thread’s instance of the requested TLS slot, thus flushing the old contents and resetting the slot to the unallocated default state. (It is not strictly necessary for this to have been done in kernel mode, although the designers chose to go this route.)

When a thread exits normally, if the TlsExpansionSlots pointer has been allocated, it is freed to the process heap. (Of course, if a thread is terminated by TerminateThread, the TlsExpansionSlots array is leaked. This is yet one reason among innumerable others why you should stay away from TerminateThread.)

Next up: Examining implicit TLS support (__declspec(thread) variables).

The optimizer has different traits between the x86 and x64 compilers

Friday, October 19th, 2007

Last time, I described how the x86 compiler sometimes optimizes conditional assignments.

It is worth adding to this sentiment, though, that the x64 compiler is in fact a different beast from the x86 compiler. The x64 instruction set brings new guaranteed-universally-supported instructions, a new calling convention (with new restrictions on how the stack is used in functions), and an increased set of registers. These can have a significant impact on the compiler.

As an example, let’s take a look at what the x64 compiler did when compiling (what I can assume) is the same source file into the code that we saw yesterday. There are a number of differences even with the small section that I posted that are worth pointing out. The approximately equivalent section of code in the x64 version is as so:

cmp     cl, 30h
mov     ebp, 69696969h
mov     edx, 30303030h
mov     eax, ebp
mov     esi, 2
cmovnz   eax, edx
mov     [rsp+78h+PacketLeader], eax

There are a number of differences here. First, there’s a very significant change that is not readily apparent from the small code snippets that I’ve pasted. Specifically, in the x86 build of this particular module, this code resided in a small helper subfunction that was called by a large exported API. With the x64 build, by contrast, the compiler decided to inline this function into its caller. (This is why this version of the code just stores the value in a local variable instead of storing it through a parameter pointer.)

Secondly, the compiler used cmovnz instead of going to the trouble of using the setcc, dec, and route. Because all x64 processors support the cmovcc family of instructions, the compiler has a free hand to always use them for any x64 platform target.

There are a number of different reasons why the compiler might perform particular optimizations. Although I’m hardly on the Microsoft compiler team, I might be able to hazard a guess as to why, for instance, the compiler might have decided to inline this code chunk instead of leaving it in a helper function as in the x86 build.

On x86, the call to this helper function looks like so:

push    [ebp+KdContext]
lea     eax, [ebp+PacketLeader]
push    eax
push    [ebp+PacketType]
call    _KdCompReceivePacketLeader@12

Following the call instruction, throughout the main function, there are a number of comparisons between the PacketLeader local variable (that was filled in by KdCompReceivePacketLeader) and one of the constants (0x69696969) that we saw PacketLeader being set to in the code fragment. To be exact, there are three occurances of the following in short order after the call (though there are control flow structures such as loops in between):

cmp     [ebp+PacketLeader], 69696969h

These are invariably followed by a conditional jump to another location.

Now, when we move to x64, things change a bit. One of the most significant differences between x64 and x86 (aside from the large address space of course) is the addition of a number of new general purpose registers. These are great for the optimizer, as they can allow for a number of things to be cached in registers instead of having to be either spilled to the stack or, in this case, encoded as instruction operands.

Again, to be clear, I don’t work on the x64 compiler, so this is really just my impression of things based on logical deduction. That being said, it would seem to me that one sort of optimization that you might be able to make easier on x64 in this case would be to replace all the large “cmp” instructions that reference the 0x69696969 constant with a comparison against a value cached in a register. This is desirable because a cmp instruction that compares a value dereferenced based on ebp (the frame pointer, in other words, a local variable) with a 4-byte immediate value (0x69696969) is a whopping 7 bytes long.

Now, 7 bytes might not seem like much, but little things like this add up over time and contribute to additional paging I/O by virtue of the program code being larger. Paging I/O is very slow, so it is advantageous for the compiler to try to reduce code size where possible in the interest of cutting down on the number of code pages.

Because x64 has a large number of extra general purpose registers (compared to x86), it is easier for the compiler to “justify” the “expenditure” of devoting a register to, say, caching a frequently used value for purposes of reducing code size.

In this particular incident, because the 0x69696969 constant is both referenced in the helper function and the main function, one benefit of inlining the code would be that it would be possible to “share” the constant in a cached register across both the reference in the helper function code and all the comparisons in the main function.

This is essentially what the compiler does in the x64 version. 0x69696969 is loaded into ebp, and depending on the condition flags when the cmovnz is executed will remain loaded into eax based off of a mov eax, ebp instruction.

Later on in the main function, comparisons against the 0x69696969 constant are performed via a check against ebp instead of an immediate 4-byte operand. For example, the long 7-byte cmp instructions on x86 become the following 4 byte instructions on x64:

cmp     [rsp+78h+PacketLeader], ebp

I’m sure this is probably not the only reason why the function was inlined for the x64 build and not the x86 build, but the optimizer is fairly competent, and I’d be surprised if this kind of possible optimization wasn’t factored in. Other reasons in favor of inlining on x64 are, for instance, the restrictions that the (required) calling convention places against the sort of custom calling conventions possible on x86, and the fact that any non-leaf function that isn’t inlined requires its own unwind metadata entries in the exception directory (which, for small functions, can be a non-trivial amount of overhead compared to the opcode length of the function itself).

Aside from changes about decisions on whether to inline code or not, there are a number of new optimizations that are exclusive to x64. That, however, is a topic for another day.

Compiler tricks in x86 assembly: Ternary operator optimization

Thursday, October 18th, 2007

One relatively common compiler optimization that can be handy to quickly recognize relates to conditional assignment (where a variable is conditionally assigned either one value or an alternate value). This optimization typically happens when the ternary operator in C (“?:”) is used, although it can also be used in code like so:

// var = condition ? value1 : value2;
if (condition)
  var = value1;
else
  var = value2;

The primary optimization that the compiler would try to make here is the elimination of explicit branch instructions.

Although conditional move operations were added to the x86 instruction set around the time of the Pentium II, the Microsoft C compiler still does not use them by default when targeting x86 platforms (in contrast, x64 compiler uses them extensively). However, there are still some tricks that the compiler has at its disposal to perform such conditional assignments without branch instructions.

This is possible through clever use of the “conditional set (setcc)” family of instructions (e.g. setz), which store either a zero or a one into a register without requiring a branch. For example, here’s an example that I came across recently:

xor     ecx, ecx
cmp     al, 30h
mov     eax, [ebp+PacketLeader]
setnz   cl
dec     ecx
and     ecx, 0C6C6C6C7h
add     ecx, 69696969h
mov     [eax], ecx

Broken up into the individual relevant steps, this code is something along the lines of the following in pseudo-code:

if (eax == 0x30)
  ecx = 0;
else
  ecx = 1;

ecx--; // after, ecx is either 0 or 0xFFFFFFFF (-1)
ecx &= 0x30303030 - 0x69696969; // 0xC6C6C6C7
ecx += 0x69696969;

The key trick here is the use of a setcc instruction, followed by a dec instruction and an and instruction. If one takes a minute to look at the code, the meaning of these three instructions in sequence becomes apparent. Specifically, a setcc followed by a dec sets a register to either 0 or 0xFFFFFFFF (-1) based on a particular condition flag. Following which the register is ANDed with a constant, which depending on whether the register is 0 or -1 will result in the register being set to the constant or being left at zero (since anything AND zero is zero, while ANDing any particular value with 0xFFFFFFFF yields the input value). After this sequence, a second constant is summed with the current value of the register, yielding the desired result of the operation.

(The initial constant is chosen such that adding the second constant to it results in one of the values of the conditional assignment, where the second constant is the other possible value in the conditional assignment.)

Cleaned up a bit, this code might look more like so:

*PacketLeader = (eax == 0x30 ? 0x30303030 : 0x69696969);

This sort of trick is also often used where something is conditionally set to either zero or some other value, in which case the “add” trick can be omitted and the non-zero conditonal assignment value is used in the AND step.

A similar construct with the sbb instruction and the carry flag can also be constructed (as opposed to setcc, if sbb is more convenient for the particular case at hand). For example, the sbb approach tends to be preferred by the compiler when setting a value to zero or -1 as a further optimization on this theme as it avoids the need to decrement, assuming that the input value was already zero initially and the condition is specified via the carry flag.

Fast kernel debugging for VMware, part 4: Communicating with the VMware VMM

Tuesday, October 9th, 2007

Yesterday, I outlined some of the general principles behind how guest to host communication in VMs work, and why the virtual serial port isn’t really all that great of a way to talk to the outside world from a VM. Keeping this information in mind, it should be possible to do much better in a VM, but it is first necessary to develop a way to communicate with the outside world from within a VMware guest.

It turns out that, as previously mentioned, that there happen to be a lot of things already built-in to VMware that need to escape from the guest in order to notify the host of some special event. Aside from the enhanced (virtualization-aware) hardware drivers that ship with VMware Tools (the VMware virtualization-aware addon package for VMware guests), for example, there are a number of “convenience features” that utilize specialized back-channel communication interfaces to talk to code running in the VMM.

While not publicly documented by VMware, these interfaces have been reverse engineered and pseudo-documented publicly by enterprising third parties. It turns out the VMware has a generalized interface (a “fake” I/O port) that can be accessed to essentially call a predefined function running in the VMM, which performs the requested task and returns to the VM. This “fake” I/O port does not correspond to how other I/O ports work (in particular, additional registers are used). Virtually all (no pun intended) of the VMware Tools “convenience features”, from mouse pointer tracking to host to guest time synchronization use the VMware I/O port to perform their magic.

Because there is already information publicly available regarding the I/O port, and because many of the tasks performed using it are relatively easy to find host-side in terms of the code that runs, the I/O port is an attractive target for a communication mechanism. The mechanisms by which to use it guest-side have been publicly documented enough to be fairly easy to use from a code standpoint. However, there’s still the problem of what happens once the I/O port is triggered, as there isn’t exactly a built-in command that does anything like take data and magically send it to the kernel debugger.

For this, as alluded to previously, it is necessary to do a bit of poking around in the VMware VMM in order to locate a handler for an I/O port command that would be feasible to replace for purposes of shuttling data in and out of the VM to the kernel debugger. Although the VMware Tools I/O port interface is likely not designed (VMM-side, anyway) for high performance, high speed data transfers (at least compared to the mechanisms that, say, the virtualization-aware NIC driver might use), it is at the very least orders of magnitude better than the virtual serial port, certainly enough to provide serious performance improvements with respect to kernel debugging, assuming all goes according to plan.

Looking through the list of I/O port commands that have been publicly documented (if somewhat unofficially), there are one or two that could possibly be replaced without any real negative impacts on the operation of the VM itself. One of these commands (0x12) is designed to pop-up the “Operating System Not Found” dialog. This command is actually used by the VMware BIOS code in the VM if it can’t find a bootable OS, and not typically by VMware Tools itself. Since any VM that anyone would possibly be kernel debugging must by definition have a bootable operating system installed, axing the “OS Not Found” dialog is certainly no great loss for that case. As an added bonus, because this I/O port command displays UI and accesses string resources, the handler for it happened to be fairly easy to locate in the VMM code.

In terms of the VMM code, the handler for the OS Not Found dialog command looks something like so:

int __cdecl OSNotFoundHandler()
{
   if (!IsVMInPrivilegedMode()) /* CPL=0 check */
   {
     log error message;
     return -1;
   }

   load string resources;
   display message box;

   return somefunc();
}

Our mission here is really to just patch out the existing code with something that knows how to talk to take data from the guest and move it to the kernel debugger, and vice versa. A naive approach might be to try and access the guest’s registers and use them to convey the data words to transfer (it would appear that many of the I/O port handlers do have access to the guest’s registers as many of the I/O port commands modify the register data of the guest), but this approach would incur a large number of VM exits and therefore be suboptimal.

A better approach would be to create some sort of shared memory region in the VM and then simply use the I/O port command as a signal that data is ready to be sent or that the VM is now waiting for data to be received. (The execution of the VM, or at least the current virtual CPU, appears to be suspended while the I/O port handler is running. In the case of the kernel debugger, all but one CPU would be halted while a KdSendPacket or KdReceivePacket call is being made, making the call essentially one that blocks execution of the entire VM until it returns.)

There’s a slight problem with this approach, however. There needs to be a way to communicate the address of the shared memory region from the guest to the modified VMM code, and then the modified VMM code needs to be able to translate the address supplied by the guest to an address in the VMM’s address space host-side. While the VMware VMM most assuredly has some sort of capability to do this, finding it and using it would make the (already somewhat invasive) patches to the VMM even more likely to break across VMware versions, making such an address translation approach undesirable from the perspective of someone doing this without the help of the actual vendor.

There is, however, a more unsophisticated approach that can be taken: The code running in the guest can simply allocate non-paged physical memory, fill it with a known value, and then have the host-side code (in the VMM) simply scan the entire VMM virtual address space for the known value set by the guest in order to locate the shared memory region in host-relative virtual address space. The approach is slow and about the farthest thing from elegant, but it does work (and it only needs to be done once per boot, assuming the VMM doesn’t move pinned physical pages around in its virtual address space). Even if the VMM does occasionally move pages around, it is possible to compensate for this, and assuming such moves are infrequent still achieve acceptable performance.

The astute reader might note that this introduces a slight hole whereby a user mode caller in the VM could spoof the signature used to locate the shared memory block and trick the VMM-side code into talking to it instead of the KD logic running in kernel mode (after creating the spoofed signature, a malicious user mode process would wait for the kernel mode code to try and contact the VMM, and hope that its spoofed region would be located first). This could certainly be solved by tighter integration with the VMM (and could be quite easily eliminated by having the guest code pass an address in a register which the VMM could translate instead of doing a string hunt through virtual address space), but in the interest of maintaining broad compatibility across VMware VMMs, I have not chosen to go this route for the initial release.

As it turns out, spoofing the link with the kernel debugger is not really all that much of a problem here, as due the way VMKD is designed, it is up to the guest-side code to actually act on the data that is moved into the shared memory region, and a non-privileged user mode process would have limited ability to do so. It could certainly attempt to confuse the kernel debugger, however.

After the guest-side virtual address of the shared memory region is established, the guest and the host-side code running in the VMM can now communicate by filling the shared memory region with data. The guest can then send the I/O port command in order to tell the host-side code to send the data in the shared memory region, and/or wait for and copy in data destined from a remote kernel debugger to the code running in the guest.

With this model, the guest is entirely responsible for driving the kernel debugger connection in that the VMM code is not allowed to touch the shared memory region unless it has exclusive access (which is true if and only if the VM is currently waiting on an I/O port call to the patched handler in the VMM). However, as the low-level KD data transmission model is synchronous and not event-driven, this does not pose a problem for our purposes, thus allowing a fairly simple and yet relatively performant mechanism to connect the KD stub in the kernel to the actual kernel debugger.

Now that data can be both received from the guest and sent to the guest by means of the I/O port interface combined with the shared memory region, all that remains is to interface the debugger (DbgEng.dll) with the patched I/O port handler running in the VMM.

It turns out that there’s a couple of twists relating to this final step that make it more difficult than what one might initially expect, however. Expect to see details on that (and more) on the next installment of the VMKD series…

Reversing the V740, part 4: Implementing a solution

Friday, July 6th, 2007

In the previous post in this series, I described some of the functionality in place in the V740’s abstraction module for the Verizon connection manager app, and the fact that as it was linked to a debug build of the Novatel SDK, reversing relevant portions of it would be (relatively) easy (especially due to the numerous debug prints hinting at function names throughout the module).

As mentioned last time, while examining the WmcV740 module, I came across some functions that appeared as if they might be of use (and one in particular, Diag_Call_End, that assuming my theory panned out, would instruct the device to enter dormant mode – and from there, potentially reacquiring an EVDO link if available).

However, several obstacles remained in the way: First, there were no callers for this function in particular, potentially complicating the process of determining valid input arguments if the purpose of any arguments were not immediately obvious from the function implementation. Second, the function in question wasn’t in the export table of the DLL, so there existed no (clean) way to resolve its address.

The first problem turned out to be a fairly trivial one, as very basic analysis of the function determined that it didn’t even take any arguments. It does use some global state, however that global state is already initialized by exported functions to initialize the abstraction layer module, meaning that the function itself should be fairly straightforward to use.

From an implementation standpoint, the function looked much like many of the other diagnostics routines shipped with the Novatel SDK. The function is essentially a very thin wrapper around the communications protocol used to talk to the device firmware, and doesn’t really add a lot of “value” on top of that, other than managing the transmission of a request to the firmware and the reception of a response. In pseudocode, the function is roughly laid out as follows:

Diag_Call_End()
{
  DebugPrint(severity, "Diag_Call_End: Begin\\n");

  acquire-lock;
  pre-send-serial-port-setup;

  initialize-packet;

  //
  // Set the packet opcode.  All other
  // packet parameters are defaults.
  //
  TxPacket.Cmd = NVTL_CMD::DIAG_CALL_END;

  //
  // Transmit the request to the firmware
  //
  Error = Diag_Send_Tx_Packet(&TxPacket, PACKET_SIZE);

  if (Error)
   handle-error;

  //
  // Receive the response.
  //
  Error = GetResponse(NVTL_CMD::DIAG_CALL_END);

  if (Error)
   handle-error;

  if (Bad response format)
   handle-error;

  //
  // Clean up and return
  //
  post-send-serial-post-cleanup;
  free-memory(response-buf);

  release-lock;

  DebugPrint(severity, "Diag_Call_End: "
   "End: RetVal: %d\\n", return-status);
  return success;
}

It’s a good thing the debug prints were there, as there isn’t really anything to go on besides them. All this function does, from the perspective of the code in the DLL, is simply set up a (very simple) request packet, send it to the firmware, receive the response, and return to the caller. This same structure is shared by most of the other Diag_* functions in the module which communicate to the firmware; in general, all those functions do is translate C arguments into the over-the-wire protocol, call the functions to send the packet and wait for a response, and then unpackage the response data back into return data for the C caller (if applicable). The firmware is responsible for doing all the real work behind the requests. Putting it another way, think of the SDK functions embedded in the WMC module as RPC stubs, the driver that creates the virtual serial port as the RPC runtime library, and the firmware on the card as the RPC server (although the whole protocol and data repackaging process is far simpler than RPC).

Now, because most (really, all) of the logic for implementing particular requests resides in the device firmware on the card, the actual implementation is for the most part a “black box” – we can see the interface (and sometimes have examples for how it is called, if a certain SDK function is actually used), but we can’t really see what a particular request will do, other than observe side effects that the client code calling that function (if any) appears to depend upon.

Normally, that’s a pretty unpleasant situation to be in from a reversing standpoint, but the debug prints at least give us a fighting chance here. Thanks to them, as opposed to an array of un-named functions that send different unknown bitpatterns to an opaque firmware interface, at least we know what a particular firmware call wrapper function is ostensibly supposed to do (assuming the name isn’t too cryptic – we’ll probably never know what some things like “nw_nw_dtc_sms_so_get” actually refer to exactly).

Back to the problem at hand, however. After analyzing Diag_Call_End, it’s pretty clear that the function doesn’t take any arguments and simply returns an error code (or success) indicator to the caller. All of the global state depended upon by the function is the “standard stuff” that is shared by anything using the firmware comms interface, including functions that we can observe being called indirectly by the connection manager app, so it’s a good bet that we should be able to just call the function and see what happens.

However, there’s still the minor snag relating to the fact that Diag_Call_End isn’t exported from WmcV740.dll. There are a couple of different approaches that we could take to try and solve this problem, with varying degrees of complexity, depending on our requirements. For example, in an attempt to provide some level of automatic compatibility with future (or previous) releases, we might implement some kind of code fingerprinting that could be used to scan code in the DLL to look for the start of this particular function. In this instance, however, I decided it wasn’t really worth the trouble; for one, WmcV740.dll is fairly well self-contained and doesn’t depend on anything other than the driver to set up the virtual serial port (and the device, of course), and from examining debug prints in the DLL, it became clear that it was designed to support multiple firmware revisions (and even multiple devices). Given this, it seemed an acceptable limitation to tie a program to this particular version of WmcV740.dll and trust that it will remain backwards/forwards compatible enough with any device firmware updates I apply (if any). Because the DLL is self-contained, the connection manager software could even conceivibly be updated after placing a copy of the DLL in a different location, since it isn’t tied into the rest of the connection manager software in any meaningful way.

As a result of these factors, I settled on just hardcoding offsets from the module base address to the start of the function in question that I wanted to call. Ugly, yes, but in this particular instance, it seemed like the most reasonable compromise. Recall that in Win32, the HMODULE value returned by LoadLibrary is really the base address of a given module, making it trivially easy to locate a loaded module base address in-memory. From there, it was just a matter of adding the offsets to the module base to form complete pointer values, casting these to function points, and making the call.

After all of that, all that’s left is to try the function out. This involves loading the WMC module, calling a standard export, WMC_Startup to initialize it, and then just making the call to the non-exported Diag_Call_End.

As luck would have it, the function call did exactly what I had hoped – it caused the device to enter dormant mode if there was an active data session. The next time link activity occured, the call was picked back up (and if the call had failed over to 1xRTT and an EVDO link could be re-acquired, the call would be automatically upgraded back to EVDO). Not quite as direct as simply commanding the card to re-scan for EVDO, but it did get the job done, if in a slightly round-about fashion.

From there, all that remained was to add an automated component to this – periodically ask the card whether it was in 1xRTT or EVDO mode, and if the latter, push the metaphorical “end call” button every so often to try and coax the card into switching over to EVDO. This information is readily available via the standard WMC abstraction layer (which was fairly well understood at this point), albeit with a caveat: The card appears to not even try to scan for an EVDO link after it has failed over to 1xRTT (or if it does, it doesn’t make this fact known to anything on the other end of the firmware comms interface as far as I could tell), meaning that it’s not easy to distinguish between the device being in 1xRTT mode due to there really being no EVDO coverage locally, period, or because you went under a bridge/into an elevator/whatever for a moment and temporarily lost signal, and the device picked the wrong network up when it re-acquired signal.

Still, all things considered, the solution is workable (if a major hack in terms of architecture). For those in a similar predicament, I’ve posted the program that I wrote to periodically try to re-acquire an EVDO link based on the information I arrived at while working on this series. It’s a console app that will display basic signal strength statistics over time, and will (as previously mentioned) automatically place the device into dormant mode every so often while you’re on 1xRTT, in an attempt to re-acquire EVDO access after a signal loss event. To use it, you’ll need the VC2005 SP1 CRT installed, and you’ll also need WmcV740.dll version 1.0.6.6 (exact match required for the dormant mode functionality to operate), which comes with the current version of VZAccess Manager for the V740 for Windows Vista (at the time of this writing, that’s 6.1.8). Other versions may work if they include the exact same version of WmcV740.dll. You’ll need to place WmcV740.dll in the same directory as wwanutil.exe for it to function, or it’ll bail out when it can’t load the module. Also, only one program can talk to the V740’s firmware communication port at a time, which means that while you are running wwanutil, you can’t run VZAccess (or any other program that tries to talk to the V740’s firmware communication port – if you try to start wwanutil while VZAccess is using the card, you’ll get error 65, and likewise, if you try to start VZAccess while wwanutil is running, VZAccess will complain that something else is using the device). You can still dial the connection manually via Windows DUN, however – the “AT” modem port is unaffected.

Of course, software considerations aside, you’ll also need a V740 (otherwise known as a Merlin X720) ExpressCard as well, with a corresponding service provider plan. (As far as I can tell, the Sprint and Verizon Novatel Rev.A ExpressCards are all rebranded Novatel Merlin X720’s and should be functionally identical, but as I am not a Sprint customer, I can’t test that.) Theoretically, the WmcV740 module supports other Novatel devices, but I haven’t tested that either (I suspect that the protocol used to talk to the firmware is actually a generic Qualcomm chipset diagnostics protocol that may function across other manufacturers – it sure seems to be very similar to the protocol that BitPim uses to talk to many Qualcomm phones, for instance – but the Wmc module will only detect Novatel devices). Also, given that the program is calling undocumented functions in the device’s firmware control interface, I’d recommend against trying it out on every single device you can get your hands on, just to be on the safe side. Although the module is theoretically smart enough to detect whether it’s really talking to a Novatel device of a sufficiently high firmware/protocol revision, or something else, I can’t help you if you somehow manage to brick your card with it (though I don’t see how you’d possibly do that with the program, just covering all the bases…). The usual disclaimers apply: no warranty provided (this program is provided “as-is”), and I can’t provide support for your device or add support for (insert X random other device here).

If you hit Alt-2 while the wwanutil console window is up, you’ll get some statistics akin to the field test mode available in VZAccess manager, although I can’t guarantee that the FTM option in VZAccess was actually accurate (or tell you how to interpret many of the fields). Since the verbose display is based on the same information as the connection manager GUI, it is probably just as accurate (or inaccurate) as the normal RTM display, though perhaps in a more readable format. Alt-3 will also display a log of recent connection events (Alt-1 to return to the main screen), and you can use the Ctrl-D keystroke combination at any of the screens to manually force the device into dormant mode (though it may immediately pick back up into active mode if there is link activity, just as if you hit “end call” on a tethered handset and the link was still active).

With a workable solution for my original predicament found, this wraps up the V740 series (at least for now…). Hopefully, at some point, support for things like periodically auto-reacquiring EVDO might find itself into the stock connection manager software, but for now, this will have to do.

Reversing the V740, part 3: The V740 abstraction layer module

Wednesday, July 4th, 2007

Last time, I described (at a high level) the interface used by the Verizon connection manager to talk to the V740, and how it related to my goal of coercing the V740 to re-upgrade to EVDO after failing over to 1xRTT, without idling the connection long enough for dormant mode to take over.

As we saw previously, it turned out that the V740’s abstraction layer module (WmcV740.dll) was static linked to a debug version of a Novatel SDK that encapsulated the mechanism to talk to the V740’s firmware and instruct the device to perform various tasks.

Now, the WmcV740.dll module contains code that is specific to the V740 (or at least specific to Novatel devices), which knows how to talk to a V740 that is connected to a computer. Internally, the way this appears to work is that the Novatel driver creates a virtual serial port (conveniently named Novatel Wireless Merlin CDMA EV-DO Status Port (COMxx)), which is then used by the DLL to send and receive data from the firmware interface. In other words, the virtual serial port is essentially a “backdoor” control channel to talk to the firmware, separate from the dial-up modem aspect of the device that is also presented by the driver in the form of a modem device that can be controlled via standard “AT” commands. (The advantage to taking this approach is that a serial port is typically only accessible by one program at a time, and if one is using the standard RAS/dial-up modem support in Windows to dial the connection, then this precludes being able to use the modem serial port to perform control functions on the device, as it’s already being used for the data session.)

By simply looking at the debug print strings in the binary, it’s possible to learn a fair amount at what sort of capabilities the SDK functions baked into the binaries might hold. Most functions contained at the very least a debug print at the start and end, like so:

Diag_Get_Time   proc near

push    ebp
mov     ebp, esp

[...]

 ; "Diag_Get_Time: Begin\\n"
push    offset aDiag_get_timeB
push    2               ; int
call    DebugPrint
add     esp, 8
cmp     IsNewFirmware, 0
jnz     short FirmwareOkay1

cmp     FirmwareRevision, 70h
jnb     short FirmwareOkay1

 ; "Diag_Get_Time: End: Old Firmware\\n"
push    offset aDiag_get_timeE
push    2               ; int
call    DebugPrint
add     esp, 8
mov     ax, 14h
jmp     retpoint

[...]

Although most of the routines named in debug prints didn’t appear all that relevant to what I was planning on doing, at least one stood out as worth further investigation:

strings WmcV740.dll | find /i "Begin"
[..]
Diag_ERI_Clear: Begin
Diag_Call_Origination: Begin
Diag_Call_End: Begin
Diag_Read_DMU_Entry: Begin
[...]

(Strings is a small SysInternals utility to locate printable strings within a file.)

In particular, the Diag_Call_End routine looked rather interesting (recall that at least for my handset, pressing the “end call” button on the device while a data connection is active places it into dormant mode). If this routine performed the same function as my handset’s “end call” button, and if “ending” the data call also put the V740 into dormant mode (such that it would re-select which network to use), then it just might be what I was looking for. There were several other functions that looked promising as well, though experimentation with them proved unhelpful in furthering my objective in this particular instance.

At this point, there were essentially two choices: Either I could try and re-implement the code necessary to speak the custom binary protocol used by WmcV740.dll to communicate with the device, or I could try and reuse as much of the already-existing code in the DLL as possible to achieve the task. There are trade-offs to both approaches; reimplementing the protocol would allow me to bypass any potential shortcomings in the DLL (it actually turns out that the DLL is slightly buggy, and the commands to reset power on the device will result in heap corruption on the process heap – ick! – fortunately, those functions were not required for what I wanted to do). Additionally, with a “from-scratch” implementation, assuming that I was correct about my theory that the the driver’s virtual serial port is just a pass-through to the firmware, such a re-implementation would possibly be more portable to other platforms for which Novatel doesn’t supply drivers (and might be extendable to do things that the parts of the SDK linked to WmcV740.dll don’t have support for).

However, taking the approach of reimplementing a client for the firmware control interface also would require a much greater investment of time (while not extremely complicated, extra work would need to be done to reverse enough of the protocol to get to the point of sending the desired commands as performed by the “Diag_Call_End” function that we’re interested in) compared to simply reusing the already existing protocol communications code in the DLL. Furthermore, reimplementing the protocol client from scratch carries a bit more risk here than if you were just reverse engineering a network protocol, because instead of talking to another program running on a conventional system, in this case, I would be talking to the firmware of a (non-serviceable-by-me) device. In other words, if I managed to botch my protocol client in such a way as to cause the firmware on the device to do something sufficiently bad as to break the device entirely, I’d have a nice expensive paperweight on my hands (without knowing a whole lot more about the firmware and the device itself, it’s hard to predict what it’ll do when given bad input, or the like – although it might well be quite robust against such things, that’s still a relatively risky decision to make, because it could just as well fall over and die in some horribly bad way on malformed input). Not fun, by any means, that. To add to that, at this point, I didn’t even know if the Diag_Call_End function would pan out at all, so if I spent all the time to go the full nine yards just to try it out, I might be blowing a lot of effort on another dead end.

Given that, I decided to go the more conservative route of trying to use the existing code in WmcV740.dll, at least initially. (Although I did research the actual protocol a fair bit after the fact, I didn’t end up writing my own client for it, merely just reusing what I could from the WmcV740.dll module). However, there’s a minor sticking point here; the DLL doesn’t provide any way to actually reach the code in Diag_Call_End that is externally visible to code outside the module. In other words, there are no exported functions that lead to Diag_Call_End. Actually, the situation was even more grim than that; after a bit of analysis by IDA, it became immediately clear that there weren’t any callers of Diag_Call_End present in the module, period! That meant that I wouldn’t have a “working model” to debug at runtime, as I did with the interface that WmcV740.dll exports for use by the Verizon connection manager GUI.

Nonetheless, the problem is not exactly an insurmountable one, at least not if one is willing to get their hands dirty and use some rather “under-handed tricks”, assuming that we can afford to tie ourselves to one particular version of WmcV740.dll.

Next time: Zeroing in on Diag_Call_End as a potential solution.

Reversing the V740, part 2: Digging deeper: The connection manager software

Tuesday, July 3rd, 2007

Continuing from the previous article, the first step in determining a possible way to kick the V740 back into EVDO mode is to understand how the Verizon connection manager software interfaces with it. In order to do this, I used WinDbg and IDA for debugging and code analysis, respectively.

The software that Verizon ships presents a, uninform user interface regardless of what type of device you might be using. This means that it likely has some sort of generic abstraction layer to communicate with any connected devices (as the connection manager software supports a wide variety of cards and tethered phones, from many different manufacturers).

Indeed, it turns out that there is an abstraction layer, in the term of a set of DLLs present in the connection manager installation directory, which are named based off of the device they support (e.g. WMCLGE_VX8700.dll, WMC_NOK_6215i.dll, WmcV740.dll). These DLLs implement a high level interface in the form of a (fairly simple) exported C API, which is then used by the connection manager software to control the connected device.

The abstraction layer is fairly generic across devices, and as such does not support a whole lot of advanced device-specific features. However, it does provide support for operations like querying information about the network and over the air protocols in use, link quality (e.g. signal levels), sending and receiving SMS messages, manipulating the phone book (for some devices, not the V740 as it would have it), powering on and off the device, performing over the air activation, and a handful of other operations. In turn, each device-specific module translates requests from the standard exported C API into calls to their associated devices, using whatever device-specific mechanism that device offers for communication with the computer. This is typically done by building a sort of adapter from the connection manager’s expected C API interface to a vendor-specific library or SDK for communicating with each device.

As most of the potentially APIs in question have only one or two parameters (and could be easily inspected on the fly by debugging the connection manager) it was fairly trivial to gain a basic working understanding of the device abstraction layer API. The approach that I took to accomplish this was to simply set a breakpoint on all of the exported functions (in WinDbg, bm wmcv740!*), and from there, record which functions were invoked by what parts of the connection manager GUI and inspect in/out parameter values, in conjunction with analysis in IDA. Although there are some fairly tantalyzingly named functions (e.g. WMC_SetPower, WMC_SetRadioPower), none of the APIs in question turned out to do just what I wanted. (The calls to set low power mode / normal mode / on / off for the device cause any active PPP dialup connection over the device to be reset, thereby ruling them out as useful for my purposees.)

The fact that none of the APIs did what I set out to look for (though I did determine how to do some interesting things like determine signal strengths and what sort of coverage is present) is not all that surprising, as there doesn’t appear to be any sort of support for forcing a device to change protocols on-the-fly in the connection manager app (or support for forcing a dormant mode transition). Because the abstraction layer essentially implements only the minimum amount of functionality for the connection manager software to function, extra device-specific functionality that isn’t used by the connection manager isn’t directly supported by the standard exported C API. All of this essentially translates into there being no dice on using the exported APIs to do what I wanted with forcing the card to upgrade back to EVDO on-the-fly.

While this might initially seem like a pretty fatal roadblock (at least as far as using the connection manager and its libraries to perform the task at hand), it turned out that there was actually something of value in the V740 module after all, even if it didn’t directly expose any useful functionality of this sort through the exported interface. In a stroke of luck, it just so happened that the V740 module was linked to what appeared to be a debug version of the Novatel SDK. In particular, two important points arose from this being the case:

First, the debug version of the Novatel SDK would appear to have handy debug prints all over the place, many naming the function they are called from, providing a fair amount of insight into what internal functions in the module are named. Essentially, the debug prints made it essentially the same as if public symbols had been made available for the Novatel SDK part of the module, as most non-trivial functions in the SDK appear to contain at least one debug print, and virtually all the debug prints named the function they were called from.

Second, the SDK itself and the abstraction layer module appear to have been built without a certain type of linker optimization (/OPT:REF) that removes unreferenced code and data from a final binary. (Disabling this optimization is a default setting for debug builds.) This meant that for the most part, a large part of the Novatel SDK (with debug prints included) happened to be present in the V740 abstraction layer module. As a result of this, there existed the possibility that there might be something of use still there inside the V740 module, even if it wasn’t directly exposed.

At this point, I decided to continue down the road of examining the abstraction layer module, in the hopes that some portion of the baked-in Novatel SDK might be of use.

Next time: Taking a closer look at the V740 abstraction layer module and the Novatel SDK embedded into it.

Reversing the V740, part 1: Rationale

Monday, July 2nd, 2007

Recently, I got a V740 EVDO Rev.A ExpressCard for mobile Internet access. Aside from a couple of shortcomings, the thing has actually worked fairly well. However, there are some problems with it which prove fairly annoying in practice.

Aside from blinky lights gadget syndrome, the other major deficiency I’ve encountered with the V740 (really a rebranded Novatel Merlin X720) so far is a problem with the 1xRTT<->EVDO handoff logic.

(There’s a little bit of important background information about 1xRTT/EVDO networks here that is relevant for understanding the rest of the series, so bear with me for a minute. Most of this information is from Wikipedia, and I’ve tried to include links to relevant parts along the way.)

You see, the way the card works is that it supports both 1xRTT-based networks (which are fairly common as long as you’re not completely in the middle of nowhere) for rougly dialup modem speeds, and 1xEVDO-based networks, for low(er) latency, high(er) bandwidth. EVDO is still limited to populous areas – e.g. medium to large size cities and the surrounding areas, and isn’t as ubiqitous as 1xRTT coverage is, at least in the US.

Because EVDO is hardly universal in the US, most CDMA data devices supporting EVDO can actually use both 1xRTT or EVDO, depending the local coverage. Beyond that, most such devices also support (mostly seamless) handoff between EVDO and 1xRTT, such that when you move out of an EVDO coverage area and into a 1xRTT-only coverage area, the data session is transferred without any interruption other than a momentary blip in latency as the air interface is changed from EVDO to 1xRTT. This can happen even while data is being actively transferred on the link without dropping TCP connections or anything as intrusive as that, which is very nice; aside from an increase in latency and decrease in bandwidth as you move to 1xRTT, everything otherwise continues to magically “just work”.

However, there is a minor problem with this logic: Sometimes, if you happen to pass through an area with spotty cell coverage (bridge, elevator, whatnot), you’ll lose one or both of the EVDO or 1xRTT signals temporarily. Depending on whether the device reacquires the EVDO signal fast enough, this sometimes results in the call being handed over to 1xRTT even though you’re really still in an EVDO coverage area. The result is that all of a sudden, everything is high latency and low bandwidth, as the device has decided to use the lower quality network. Unfortunately, there doesn’t seem to be support in the V740 to automatically “fail upwards” from 1xRTT back to EVDO if the device re-enters an area with EVDO signal. This means that your data session is now stuck in “slow mode” due to a minor coverage blip – not so good.

The typical solution that people take to this problem is to instruct their device to disable 1xRTT support and only use EVDO, which does work fine if you are really ever only in an EVDO coverage area. However, I really didn’t want to take this approach, as it means manually futzing around with the card’s configuration when you travel. Worse, it makes you less robust against situations where there really is better connectivity to the 1xRTT network than the EVDO network, even if you’re in an EVDO coverage region.

It turns out that there is actually a way to cause the device to re-select the best network after it has failed over to 1xRTT (other than re-dialing the connection), but it is unfortunately not all that seamless: When the device enters “dormant mode” and then transitions back to an active state, it will rescan available networks and pick the best quality network with which to re-engage the link. (Dormant mode is a mode where the device essentially relinquishes its over-the-air slot, freeing it up for another caller, until either end detects activity on the link. Then, the link is seamlessly re-established, again without TCP connections or the like being dropped. Having the device let go of its over-the-air slot is good for network operators, as it allows the system to essentially “borrow back” the capacity used for an otherwise idle session until the session needs it again.)

Normally, dormant mode is only activated when both ends of the link have been idle for some pre-set period of time, usually several seconds. While this means that if you have a fairly idle connection, you’ll eventually get switched back over to EVDO, if your connection is active, it will not enter dormant mode and will thus continue to run on 1xRTT.

Now, if you’re using a conventional handset in tethered mode (Bluetooth or via USB cable), you can often forcibly enter dormant mode on the device with the “end call” button, and as soon as either end of the link tries to send data, the device will pick up the call on the best network, essentially offering a manual way to force the device to consider switching back to EVDO. (At least, that’s how it works on my LG VX8100 – your mileage my vary based on your device, of course.) While that’s not ideal (it’s a completely manual process to hit the end call button, of course), it does get the job done.

With an ExpressCard, however, the problem is a bit more severe, simply because there is no “end call” button. (Sure, you can disconnect the session, but that kind of defeats the whole point – the goal is to re-establish the high speed network connection without dropping your active TCP connections). Unfortunately, as an end user here, you’re pretty much out of luck without the device vendor (or the network operator) providing a tool to coax the device into reconsidering which data network to use (and as far as I know, neither Verizon nor Novatel provide such a utility).

Not satisified with this, I decided to take matters into my own hands, so to speak, and see if there might be a way to programmatically tell the card to switch back to EVDO, or at least enter dormant mode on-demand without idling all data traffic, so that it can automatically select EVDO when resuming from dormant mode. The end desired result would be a program that I could either manually run or that could run continually while a session is active, and try to re-establish the EVDO link where appropriate. While Novatel does appear to make some mention of an SDK for their devices available on their website, it doesn’t appear to be readily available to individual users, which unforunately rules it out. That leaves reverse engineering the way the V740 is controlled from the computer in the hopes of finding something to achieve the desired results.

Fortunately, reversing things is something that I do happen to have a little bit of knowledge of…

Next time: Popping the hood on Verizon’s connection manager app (“VZAccess Manager”), and how it talks to the V740 ExpressCard.