AVEKSHA: A Hardware-Software Approach for Non-intrusive Tracing

AVEKSHA: A Hardware-Software Approach for Non-intrusive

Tracing and Profiling of Wireless Embedded Systems∗

Matthew Tancreti

Purdue University

West Lafayette, IN, USA

[email protected]

Mohammad Sajjad Hossain

Purdue University


[email protected]

Saurabh Bagchi

Purdue University


[email protected]

Vijay Raghunathan

Purdue University


[email protected]

AbstractIt is important to get an idea of the events occurring in

an embedded wireless node when it is deployed in the field,away from the convenience of an interactive debugger. Suchvisibility can be useful for post-deployment testing, replay-based debugging, and for performance and energy profilingof various software components. Prior software-based solu-tions to address this problem have incurred high executionoverhead and intrusiveness. The intrusiveness changes theintrinsic timing behavior of the application, thereby reduc-ing the fidelity of the collected profile. Prior hardware-basedsolutions have involved the use of dedicated ASICs or othertightly coupled changes to the embedded node’s processor,which significantly limits their applicability.

In this paper, we present AVEKSHA, a hardware-softwareapproach for achieving the above goals in a non-intrusivemanner. Our approach is based on the key insight that mostembedded processors have an on-chip debug module (whichhas traditionally been used for interactive debugging) thatprovides significant visibility into the internal state of theprocessor. We design a debug board that interfaces withthe on-chip debug module of an embedded node’s proces-sor through the JTAG port and provides three modes ofevent logging and tracing: breakpoint, watchpoint, and pro-gram counter polling. Using expressive triggers that the on-chip debug module supports, AVEKSHA can watch for, andrecord, a variety of programmable events of interest. A keyfeature of AVEKSHA is that the target processor does not

∗This work was supported in part by NSF grants CNS-0953468, ECCS-0925851, and CNS-0716271. The views ex-pressed represent those of the authors and do not necessarily reflectthe views of the sponsoring agency. We thank the anonymous ref-erees and our shepherd, Kay Romer, for their insightful comments,suggestions, and feedback.

Permission to make digital or hard copies of all or part of this work for personal orclassroom use is granted without fee provided that copies are not made or distributedfor profit or commercial advantage and that copies bear this notice and the full citationon the first page. To copy otherwise, to republish, to post on servers or to redistributeto lists, requires prior specific permission and/or a fee.SenSys’11, November 1–4, 2011, Seattle, WA, USA.Copyright 2011 ACM 978-1-4503-0718-5/11/11 ...$10.00

have to be stopped during event logging (in the last two of thethree modes), subject to a limit on the rate at which loggedevents occur. AVEKSHA also performs power monitoring ofthe embedded wireless node and, importantly, enables powerconsumption data to be correlated to events of interest.

AVEKSHA is an operating system-agnostic solution. Wedemonstrate its functionality and performance using threeapplications running on Telos motes; two in TinyOS and onein Contiki. We show that AVEKSHA can trace tasks and othergeneric events at the function and task-level granularity. Wealso describe how we used AVEKSHA to find a subtle bug inthe TinyOS low power listening protocol.Categories and Subject Descriptors

D.2.5 [Software Engineering]: Testing and Debug-ging—Debugging Aids, TracingGeneral Terms

Design, MeasurementKeywords

Wireless Sensor Network, Debugging, Tracing, JTAG1 Introduction

It is often important to get an idea of the events occur-ring in an embedded wireless node when it is deployed inthe field in a remote location, away from the convenience ofan interactive debugger. Such visibility can be useful for var-ious purposes — for debugging any problem a posteriori inthe lab, by recreating the exact sequence of events that thenode experienced in the deployment (this approach is called“record and replay-based debugging”) [1, 2]; for profilingthe operation of a node for the performance of its varioussoftware components and the energy consumed by differenthardware and software components on the node [3, 4, 5]. Asan example of the latter use case, a system owner may beinterested in figuring out which software component is beinginvoked most often and which software component is con-suming most energy per invocation. It is often not possibleto do these determinations in a lab setting because the eventsthat the node experiences in the deployment cannot be recre-ated in the lab and the events (and even their sequence) canhave a bearing on these questions.

We would like to have visibility at a fine granularity - bothspatially and temporally. Spatially fine visibility implies thatit should be possible to trace individual events of interest asopposed to only bursts of events (clearly, tracing every event

is likely to be prohibitive) and it should be possible to traceperformance and energy at fine code regions, such as a func-tion or a task (using TinyOS terminology). This is desirablebecause the fine region of code can then be debugged if it isdetermined through performance profiling that this region iscausing a performance bottleneck, through energy profilingthat it is consuming unexpectedly large amounts of energy,or through record and replay that it is the source of a bug.Temporally fine visibility implies that it should be possibleto do the tracing with a high sampling frequency. Clearly, thetwo dimensions are not independent. In order to trace smallregions of code in a loop, it is necessary to be able to trace ata fine temporal granularity.

While the problem motivation laid out above has beenclear to researchers for quite some time [6], it has provedvery difficult to provide a solution for low-cost embeddedwireless nodes that can operate at a large deployed scale. Thefirst line of attack has been to provide pure software solutions[7, 1, 2]. Such solutions have perturbed the application toomuch to be useful for many of the use cases indicated above.For one, they change the timing behavior enough that somebugs get suppressed. Else, they cause such a large slowdownin the application execution that it is not possible to employthem in a deployed setting. To get around this problem, arecent software solution [2] has focused on a specific kind oftracing (control flow tracing) and intelligent static analysisand runtime trace collection, compression and storage. Thus,it addresses one of the above usage scenarios. The secondline of research has developed hardware solutions for subsetsof the usage scenarios laid out above. For example, [5] hasdeveloped a dedicated integrated circuit, implemented usingan FPGA, that is tightly integrated with the host processorand its peripherals and can measure energy drawn accuratelyat millisecond resolution. Quanto [4] is a solution that de-emphasizes sophisticated hardware design. Instead, it mea-sures energy at the node level, uses indication from devicedrivers about changes in power state, and performs causalitytracking to pin down energy usage due to individual activi-ties. Thus, Quanto is a hardware-software solution, and likeall prior solutions that have a software part, is OS-specific(in this case, TinyOS).

A high-end hardware solution for tracing the executionon an embedded processor is provided by solutions such asGreen Hills Software’s SuperTrace probe and TimeMachinetools [8]. These solutions can collect fine-grained trace datafrom nearly all 32-bit and 64-bit processors, even those with-out integrated trace hardware. Unfortunately, such solutionsare very expensive in dollar terms (e.g., the SuperTrace probeand TimeMachine tools together cost almost $15,000) andare not available for the low-end embedded processors thatare commonly used in embedded wireless nodes.

In summary, our problem statement is the following: Howto perform non-intrusive tracing of execution at a high spa-tial and temporal granularity suitable for an embedded wire-less node, i.e., in a low-cost manner and one that can be de-ployed at a large scale?

In this paper, we present AVEKSHA, a system that

achieves this goal1. AVEKSHA is based on an insight thatmost processors, including low-cost embedded processors,offer visibility into their internal workings through an On-Chip Debug Module (OCDM), whose signals are exposedthrough a standard JTAG interface. This interface has beenused by embedded system engineers primarily for interac-tive debugging, such as single stepping, showing values ofregisters, etc. We show how this visibility, together with thefact that most OCDMs provide a general-purpose method ofsetting triggers, can be leveraged in AVEKSHA to performautomated tracing in a deployed setting.

We develop a debug board formed of standardized com-ponents – a microcontroller unit (MCU), which in ourcase happens to be the same as the application processor,MSP430F1611 from Texas Instruments, and an Actel FPGA,both of which interact with the OCDM on the applicationprocessor over the JTAG interface. We refer to our debugboard as the Telos Debug Board (TDB) because it is intendedto be used with the Telos wireless sensor node (however,our solution is not restricted to the Telos and can easily beadapted to other embedded platforms based on the MSP430microcontroller, and with some effort to other embeddedplatforms). The MSP430 OCDM (also referred to by the mi-crocontroller datasheets as the Enhanced Emulation Moduleor EEM) allows AVEKSHA unprecedented visibility into thestate of the application processor. Further, the OCDM has asmall circular buffer where events of interest can be storedand subsequently drained by the FPGA on the TDB. Thetriggering mechanism of the OCDM is very flexible and istherefore attractive for AVEKSHA. For example, the OCDMcan be triggered to indicate when the application processorhas accessed a certain memory region or a certain peripheraldevice, such as a sensor. We find that the triggering mecha-nism can be combined with thoughtful design to trace all theevents of interest for our three usage scenarios – performanceprofiling, energy profiling, and record-and-replay.

One challenge that we face, and resolve partially, is theneed to do real-time tracing, i.e., without interrupting the ap-plication processor. AVEKSHA is able to achieve this whenthe rate of events that it has to trace does not exceed somebound, which depends on the mode of tracing it uses. AVEK-SHA operates in one of three modes: breakpoint, watchpoint,and program counter (PC) polling. Breakpoint is a baselineand we use it for demonstrating some functionality of theTDB. It is intrusive and, therefore, does not meet our solu-tion requirements. The watchpoint mode has AVEKSHA settriggers, where each trigger unambiguously maps to an eventof interest (such as when a sensor is read). When a triggerfires, the application processor is not stopped, but the state isdumped to a buffer on the OCDM, which is then emptied outby AVEKSHA. This is a rate-limited operation and if eventsof interest happen with a high enough frequency, the bufferoverflows and AVEKSHA misses some events of interest. Inthe PC polling mode, the TDB tracks the program countervalues of the application processor without interrupting it.Then, it processes the PC values to determine events of inter-est, such as when control flow has entered a particular func-

1AVEKSHA is a Sanskrit word that means “to monitor”.

!"#$% &'(% )*% +,$#%,-./&012%)31245672%

,2809%:012%

(;91426:%(<.%=10%"'%0;>0368?%

-0@391426:%(<.%=10%:012%0;>0368?%

,-.%

Figure 1. The Telos Debug Board (TDB) with a Telos RevB mote underneath, and the underside of the TDB (right).

tion. These three modes reveal different tradeoffs in termsof intrusiveness, the flexibility in defining which events tocollect, and the rate at which collection can be done.

We make the following claims to novelty and practicalfeasibility from AVEKSHA:

1. We present the first technique for non-intrusive tracingof a wide variety of events, including arbitrary user-defined events, in embedded wireless nodes. We moti-vate the events of interest from three well-accepted us-age scenarios.

2. Our tracing technique is agnostic to the operating sys-tem, compiler infrastructure, or language in which theapplication is implemented.

3. Our hardware is built using off-the-shelf componentsand requires little effort in integrating with the appli-cation board, which is modified only very slightly forenabling the tracing.

4. Our solution is suitable for deployment at a large scalebecause it is low cost, can operate on battery power, andextracts program information directly from the applica-tion processor.

Our solution also has some limitations. We provide a de-tailed discussion of these, along with thoughts on how tomitigate them, in Section 5. In brief, the TDB is a relativeenergy hog itself, drawing about the same power as the ap-plication processor board. Its ability to keep pace with eventsis exceeded if a burst of 8 events happens within a windowsmaller than 976 clock cycles (in the watchpoint mode) orevents happen more frequently than 7 clock cycles (in thePC polling mode).

The roadmap for the rest of the paper is as follows. InSection 2, we describe the hardware of the TDB and thefirmware that goes on it. In Section 3, we show how the TDBcan be used for profiling. In Section 4, we present the exper-imental setup and results with two TinyOS and one Contikiapplication. In Section 5, we discuss feasible extensions ofAVEKSHA. In Section 6, we review related work and Section7 concludes the paper.2 TDB Hardware and Firmware

In this section, we present the design of the Telos DebugBoard (TDB), which provides execution tracing and energymonitoring of the Telos Rev B mote. An MCU and an FPGAprovide the programmability of the TDB.Terminology: We lay out some terminology that we will

-+

-+

-+

-+

Rsense3.74

3.3V 3.3V5V

5V

3.3V

3.3V

ADCx10

ADCx105

100

0.1u

100

0.1uMote DVCC

10

105

Sense Amplify Filter Protect

Figure 2. Simplified schematic of the energy monitoringcircuit of the TDB.

Rsense3.74

3.3V

Vin Vout3.3V LDO

5V from USB

DVCC

Mote TDBnon-USB components

removedbattery

+-

Rload

USB components

Figure 3. Simplified schematic of powering the motethrough the TDB. The mote’s battery should not be con-nected.

use through the rest of the paper. We wish to monitor theexecution of the application processor that is part of an ap-plication processor board. The application processor board,which we sometimes also refer to as the mote, has variousperipherals such as sensors and the JTAG interface in addi-tion to the application processor. We refer to the hardwareboard that is a part of our solution as the Telos Debug Board,while the entire hardware-firmware that forms our solutionis called AVEKSHA.

Figure 1 shows a photograph of the TDB with a Telosmote attached underneath. The MCU, FPGA, and multi-ple USB ports on the TDB are highlighted. The figure alsoshows the JTAG and IO connections between the TDB andthe mote. The TDB is designed so that it can be deployed inthe field, connected to a mote. In this mode of operation, abattery powers the TDB, which in turn provides power to themote. As a secondary mode of operation, the TDB can alsostream logged events directly to a USB host such as a laptop.This is useful for in-lab debugging.2.1 Energy Monitoring

Energy is a key concern for sensor networks, becausemotes must operate unattended on battery power for longperiods of time. When optimizing an application to reduceenergy consumption, it can be useful to observe how muchenergy is consumed in different states. The TDB can mea-sure and log the power consumed by the mote, which canthen be correlated to different operational states of the mote.

The standard method for measuring energy consumptionis by monitoring the voltage over a sense resistor. The senseresistor (Rsense) is placed in series with the load being mea-sured. The voltage (V ) across Rsense is sampled and the load’scurrent draw is calculated by I =V/Rsense. The supply volt-age (Vsupply) can then be used to find the power being drawnby the load with P = I ∗Vsupply. The power samples can thenbe integrated over time to determine energy consumption.

The challenges presented in monitoring energy in sensor

networks is the wide dynamic range of power draw of a moteand the rapid changes in power draw. For example, a motemay draw only tens of µA in sleep mode and as much as30mA when fully active. It may not be sufficient to ignorethe small power draw when the mote is in low power mode,because typical sensor network applications spend long pe-riods of time in the low power state while only waking forshort periods of time. Further, the change in the current drawwhen the mote transitions from one state to another is rapid.

To meet these challenges, we use two instrumentation am-plifiers to amplify the voltage across Rsense by a gain of 10and 105, as shown in Figure 2. Because Rsense is placed onthe high side, the amplifiers need to be supplied with a volt-age larger than 3.3V, in this case 5V. The output of these am-plifiers is fed into an RC low-pass filter to avoid interferencefrom high frequency components. The cut off frequency isabout 16kHz. This value is justified in the design of theSPOT energy meter, based on an observed significant dropin energy content above this frequency [3]. Another pair ofamplifiers with unity gain is used to protect the ADCs whichcannot tolerate more than 3.3V. Two ADC channels of theMCU on the TDB sample the x10 and x105 lines at 20KHz.The ADCs are 12-bit and sample against a reference volt-age of 2.5V. This gives ADCx10 a resolution of 61µV acrossRsense which is equivalent to 16.3µA of current draw, and amaximum reading of 250mV or 66.8mA, which is more thanthe mote’s maximum current draw of 30mA. The ADCx105gives a resolution of 5.81µV across Rsense which is equiva-lent to 1.55µA of current draw, and a maximum reading of23.8mV or 6.37mA.

As shown in Figure 3, Rsense is placed between the TDB’s3.3V supply and the mote’s DVCC line, while the groundlines of the mote and the TDB are shared. This configurationis known as high side sensing. One advantage of high sidesensing is that the ground plane of the mote and the TDB areshared. Another advantage is that a Zener diode built intothe Telos achieves isolation between the USB componentsand the non-USB components, and thus allows us to cap-ture the current draw of only the non-USB components, evenwhen the mote is connected to a USB host. The diode has aforward bias of about 360mV, meaning that as long as thevoltage drop across the sense resistor remains below 360mV,all current drawn through Rload will be from the TDB. Rsenseis chosen sufficiently low such that this will happen even atmaximum power draw by the mote. The maximum currentof the mote is 30mA which would result in a voltage dropacross the sense resistor of 112mV.

To account for amplifier offset we use a switch betweenRsense and DVCC that allows 3.3V to be temporarily placedat both ends of Rsense in a manner similar to [3]. The MCUhas an ADC buffer of 16 samples, that are filled by DMA toreduce overhead. The x10 and x105 amplified signals aresampled alternately at a rate of 40ksps, to achieve an ef-fective sampling rate of 20ksps. When the ADC buffer isfull, an interrupt service routine sums up the samples in thebuffer. It is desirable to use the ADCx105 reading due to itsgreater current resolution, unless there has been an overflowin its reading. The ADCx10 value is used to determine if theADCx105 has had an overflow.

0 1 2 3 4 5 6 7BIT

TDO

TDI

TMS

TCK

Figure 4. Timing example for shift IR. The byte 0xCC isshifted in on TDI while 0x00 is shifted out on TDO.

Using the MCU’s ADC reduces system cost and complex-ity. However, it does have the disadvantage of introducinga delay between an event occurring on the application pro-cessor, and the the MCU recording the energy. The aver-age delay is 234µs. 30µs is taken by the FPGA to read theevent from JTAG, while the rest is mostly due to the timerequired for the MCU to sum the samples of the 16 entryADC buffer. An alternative design would have the FPGApoll a separate ADC. This would reduce the delay to belowthe 50µs (20KHz) sample period of the ADC.2.2 JTAG Background

The application processor contains an on-chip debugmodule. This module can be used to emulate the processor(directly control the processor operations) and it can executebreakpoints and watchpoints when certain conditions of thedata and address buses are met. A breakpoint halts executionof the processor, while a watchpoint records the contents ofthe data and address bus into an 8 entry circular buffer. TheOCDM is implemented as a state machine that is controlledvia the standard JTAG protocol.

JTAG uses four lines: data output to host (TDO), data in-put to target (TDI), mode select (TMS), and clock (TCK).JTAG shifts frames of data into and out of the OCDM andthat changes the state of the OCDM. There are two basicshift modes: an 8-bit instruction register (IR) shift and an n-bit data register (DR) shift. The TMS line selects between IRor DR at the start of a shift based on the number of TCK ris-ing edges for which it remains high. For example, in Figure4, TMS remains high for two rising edges which selects theIR mode, while one rising edge would select the DR mode.The number of bits shifted in an n-bit DR shift is determinedby TMS being high a second time during the shift of the lastbit. Bits are shifted from the mote to the TDB on the TDOline and from the TDB to the mote on the TDI line. Althoughthe JTAG protocol is standard, the sequence of instructionsthat must be shifted into the OCDM on the MSP430 is pro-prietary. We have reverse engineered these control sequencesand used them in AVEKSHA to determine what command se-quences must be sent to the OCDM for the application to en-ter a breakpoint, to set a watchpoint, or to enable PC polling.2.3 Hardware Architecture

As shown in Figure 5, the TDB consists of a USB hub,a USB to UART adapter, an MCU, and an FPGA. The USBcomponents are primarily for use in a lab environment andprovide reprogramming, control, and streaming of log data.The USB hub has 1 upstream port and 3 downstream ports.The upstream port is used to access the debug board from aPC. One of the downstream ports is permanently connectedto a USB to UART adapter that provides reprogramming and

USB Hub USB/UARTAdapter MCU

UARTUSB

UpstreamUSB

MoteUSB

MoteExpansion Interface

FPGA

16

MoteJTAG

control

data4

Figure 5. Hardware architecture of the TDB.

PC poll Functionlookup Filter Output

BufferPC

addrfuncptr

funcptr

16 16 16

MoteJTAG

MCU

Figure 6. FPGA pipeline in PC polling mode.

data transfer to and from the MCU. The second downstreamport is available for connection to the USB port on the mote.This is useful in a lab or testbed deployment where access tothe mote’s USB port is desired. The final downstream portis available for future use, and we envision it being used intestbed deployments to daisy chain several TDBs together.

The Actel IGLOO nano FPGA interfaces with the mote’sJTAG and expansion interfaces. The expansion interfaceof the mote provides access to the some of the its UART,I2C, ADC, and GPIO peripherals. It was necessary to usean FPGA to control the mote’s JTAG to be able to poll ata sufficient frequency to keep up with the events we wantto observe. For some operations (such as PC polling) wehave found the need to drive the clock line of the JTAG upto 24MHz. A software implementation would require a pro-cessor that operates at several times that speed. Additionally,there is the problem of processing the collected JTAG data todetermine what should be logged. The FPGA allows pipelin-ing of JTAG control and data processing, so that the pollingloop never waits for data processing.

The MCU performs tasks that are less time critical andbetter suited to software. For example, initialization of themote for debugging, reading the contents of the mote’s pro-gram memory, and disassembly of the mote’s program codeare performed by the MCU. Using an MCU also makesadding functionality to the debug board easier because theMCU can be reprogrammed over USB. To simplify program-ming, the MCU used on the TDB is an MSP430 processorthat is identical to the one used on the Telos mote. It oper-ates at 8MHz.

In the current prototype, the TDB logs are streamed overUSB. It would also be possible to add some Flash mem-ory to act as a circular buffer as was done in FlashBox [9].Our maximum reliable streaming throughput over USB is1Mbps. This is limited by the USB 1.0 hub and adapter,which have a theoretical throughput of 1.5Mbps. Movingto USB 2.0 would boost the USB throughput to 480Mbps,which would make the MCU the bottleneck in streaminglogs. However, we have found 1Mbps to be sufficient in allof our experiments as long as buffers are added in the MCUand FPGA to absorb short bursts of data to be logged.

2.4 Firmware ArchitectureThe firmware of the TDB consists of C code for the MCU

and Verilog code for the FPGA.Firmware on the MCU: The MCU is responsible for ini-tialization tasks. When the TDB is first connected to themote, the MCU sets the FPGA into a mode where the MCUcan directly control the JTAG lines connected to the mote.Through JTAG commands, the mote is put into a halt stateand the program memory of the mote is read. A simple dis-assembly of the program is performed, where the start ofevery function block is discovered by examining the destina-tion of every call instruction in the code. This approach willnot reveal functions that are called only by function pointer,however such functions are not common, and if they mustbe debugged then the symbol table generated from a com-piler can be loaded. The resulting table of function blocksis then programmed into the FPGA’s RAM for use by thefunction lookup module – a module that takes an address asinput, and outputs the function block containing that address.Once the table is loaded, any watchpoint trigger can be seton the mote. What triggers will be set will depend on thegoal of the tracing. We explain in Section 3 the complete listof triggers that can be supported. Finally, the MCU sets theFPGA to either the PC polling or the watchpoint mode andresumes execution on the mote. Thus, the MCU on the TDBfunctions as an orchestrator, but leaves the core functionalityfor the FPGA.

The main advantage of reading and disassembling themote’s program memory when the TDB is connected to themote is that the TDB need not be aware a priori of what ap-plication the mote is running. This process is independent ofthe compiler or operating system used by the mote. In ad-dition to finding the entry point of every function, the disas-sembly of the code is also used to record the start and returnof every interrupt service routine, the address of every func-tion call and return, and the addresses of special nop instruc-tions (e.g. MOV R4, R4), that are used as trigger markers inthe code.Firmware on the FPGA: The FPGA is responsible for con-trolling the mote’s OCDM through JTAG. The FPGA pollsthe OCDM to to detect the occurrence of any events of inter-est. Following each iteration of the polling loop, processingmay need to be performed to decide whether or not the polleddata should be logged. For example, with PC polling, a logentry should be generated when the polled PC value falls intothe address range of a new function. To prevent a slowdownin polling, processing is pipelined. Figure 6 shows the PCpolling pipeline. At the end of each PC poll, the PC ad-dress is passed to the function lookup module. The functionlookup module contains a table in RAM of the start addressof every function block. A binary search is performed onthe table to find the start address of the function block thatcorresponds to the polled PC address. The function table ca-pacity is 1024 function pointers, so the lookup completes inat most log2(1024) = 10 reads from RAM. Lookup is nota bottleneck, because the FPGA is internally clocked at 48MHz, which allows 76 clock cycles per PC poll. After thecorrect function pointer is discovered, it is passed to a filter-ing module. This module decides to log the function pointer

Event Condition # Triggers

Function call MDB-F==0x12B0 1Function return MDB-F==0x4130 1Interrupt MAB-R≥0xFFE0 1Interrupt return MDB-F==0x1300 1Peripheral read 0x0010≤MAB-R≤0x01FF 2User defined MDB-F==0x4404 1

Table 1. Types of triggers available for monitoring events.

only if it is different from the last logged function pointer.Finally, function pointers that are to be logged are passedto a FIFO output buffer maintained in the FPGA. The MCUon the TDB reads this buffer and logs the data either in lo-cal Flash memory, or if required, streams the log entries to ahost machine over USB. The buffer is necessary because theMCU performs other functions, such as energy monitoring,and may not be able to read a value to be logged in the timeit takes to perform a single PC poll. The buffer also absorbspeaks in the rate of new functions being invoked. We haveobserved a buffer size of 256 to be sufficient to absorb allpeaks in the programs that we have monitored.3 Using AVEKSHA for Tracing and Profiling

There are three modes that AVEKSHA can operate in whilemonitoring application execution, namely Breakpoint mode,Watchpoint mode, and PC Polling mode. Depending on themode of operation, AVEKSHA interacts with the OCDM onthe application processor in different ways. Therefore, thesemodes have different tradeoffs in terms of the level of intru-siveness to the application (breakpoints are the most intru-sive), the flexibility offered in terms of the kinds of eventsthat can be observed (watchpoints are the most flexible), andthe speed of event logging (PC polling is the fastest). Beforewe describe the three modes of operation, we discuss thekinds of triggers that AVEKSHA can set for observing eventsof interest on the application processor.3.1 Types of Triggers Available

The OCDM on the application processor allows us to set 8concurrent triggers for detecting events of interest. Althoughthis number may, upon first glance, seem insufficient to cre-ate a complete profile of an application, that is not the casebecause the MSP430 offers far more advanced triggers thanjust the program counter (PC) reaching a particular value.For example, a trigger can compare the Memory Data Bus(MDB) or the Memory Address Bus (MAB) to a set valueor range of values. Additionally, the trigger can be restrictedto be active only during an instruction fetch (F), a memoryread instruction (R), or a memory write instruction (W). Thisgives us great flexibility in using these 8 concurrent triggersto capture all our events of interest.

All of the triggers that we use in this paper are listed inTable 1. The notation used for specifying the condition thatthe value on the Memory Data Bus equals 0x12B0 on an in-struction fetch is given by: MDB-F==0x12B0. This particulartrigger will fire for every function call because 0x12B0 is themachine code for a function call instruction. Similarly, themachine code for the return instruction from a function callis ret=0x4130. Therefore, the trigger MDB-F==0x4130 will

trigger on all function call return events. A call to an inter-rupt can be detected with the trigger MAB-R≥0xFFE0. Theinterrupt vector table is located between address 0xFFE0 andthe end of the address space at 0xFFFF. Every time an inter-rupt is to be serviced, the processor reads the interrupt vectortable to determine the address of the interrupt service routinethat corresponds to the interrupt being serviced. Interruptshave their own return instruction (reti=0x1300) that can bemonitored with the trigger MDB-F==0x1300.

The compound trigger 0x0010≤MAB-R≤0x01FF will firefor every read to memory between addresses 0x0010 and0x01FF. A compound trigger, such as the above, that con-tains two conditions is made by joining two triggers together,and uses 2 of the 8 available trigger entries. In the MSP430,the peripherals are all memory mapped to addresses between0x0010 and 0x01FF. The peripherals include any sensorsthat may be attached to the application processor. For pur-poses of deterministic record and replay, it is important totrack what sensor values are read. This can be done by us-ing the trigger 0x0010≤MAB-R≤0x01FF, which captures aread from the memory-mapped peripheral portion of mem-ory. When a trigger is fired, the OCDM stores the values ofthe MAB and the MDB to the 8-entry circular buffer. Thestored MDB will contain the value that was read from theperipheral.

While functions and interrupts are interesting points formonitoring, we would like even more flexibility to monitorany arbitrary event in the executing application. For exam-ple, if we want to monitor the execution of every task inTinyOS, we cannot do this with a function call trigger. Thisis because the gcc compiler inlines many of the tasks in thescheduler’s runTask() function. One solution is to set thenoinline directive on all task functions. We have verifiedthat this works, however, this is unsatisfactory because it sac-rifices the efficiency gains obtained due to function call inlin-ing. A less costly solution is to trigger on a nop instruction.However, the MSP430 does not have an explicit nop instruc-tion. Instead, compilers emulate this instruction by using a 1cycle instruction that has no direct effect and no side effecton status or mode bits – specifically, gcc uses the instruc-tion MOV R3, R3 to emulate a nop. There are three possible1 cycle instructions that meet the requirements for no effector side effect: (MOV Rn, Rn), (BIC #0, Rn), and (BIS #0,Rn). With 16 registers available on the MSP430, this givesus 48 possible choices for an emulated nop. We can use dif-ferent application-specific meanings for each emulated nopinstruction to monitor 48 arbitrary events of interest. For ourpurposes we choose just one, (MOV R4, R4), which trans-lates to the machine code 0x4404, and add an instructionfetch trigger MDB-F==0x4404. A programmer can now placethe assembly code (MOV R4, R4) at arbitrary places in thecode to monitor user-defined events of interest, such as thebeginning of a task.3.2 Breakpoint Mode

Any of the 8 concurrent triggers available can be set asa breakpoint. When a breakpoint is reached, the applicationprocessor halts execution. AVEKSHA performs a continuouspoll of the CPU state of the application processor. Whenit sees that the CPU is halted, it retrieves the state of the

application processor (e.g., the value of the PC) and sendsthe JTAG command to resume execution.

Table 2 shows the speed at which a single poll (or test) ofthe CPU state can be performed, the time it takes to read thePC register, and the time it takes to resume CPU execution.All of these operations involve shifting values into the in-struction register (IR) and data register (DR) of the OCDM.For example, a test of the CPU state requires shifting oneIR and one DR, reading the PC register requires shifting 2IRs and 4 DRs, and resuming the CPU involves shifting 3IRs and 1 DR. The operations needed for achieving the tasksare not documented and we determined them through reverseengineering TI’s IAR debug interface [10]. An IR can beshifted in 15 cycles of the JTAG clock (TCK) and a DR canbe shifted in 23 JTAG clock cycles. Table 2 shows the timesfor performing the required shift operations in software withthe MCU on the TDB running at 8MHz, and the FPGA im-plementation. The FPGA is able to operate the JTAG clock(TCK) at 10MHz, which is the fastest we have been able tooperate the JTAG reliably for the breakpoint and the watch-point modes and is the maximum rated speed according tothe JTAG specification.

Breakpoints have the advantage that we never miss a trig-ger firing, because every time a trigger is reached the CPUis halted and control is passed to the TDB. The disadvantageof the breakpoint mode is that we lose the property of non-intrusiveness. In TinyOS, the MSP430 on the Telos is setby default to operate at 4MHz meaning 1µs = 4cycles. Us-ing the FPGA implementation, the time to poll and resumethe application processor is equivalent to 91.2 cycles of theapplication processor. The application processor has to behalted for at least this time while processing an event.3.3 Watchpoint Mode

Any of the 8 triggers can be set as watchpoints ratherthan breakpoints. The JTAG interface has an 8-entry circularbuffer where memory address bus (MAB) and memory databus (MDB) are stored when a watchpoint is hit. As indicatedearlier in Section 3.1, the trigger can be on an instructionfetch, read, or write. Thus, by recording the address buscontent on an instruction fetch, it is possible to know the PCvalue. The most recent entry written to the 8-entry buffer isindicated with a set flag. AVEKSHA polls the flag of the mostrecently written entry until it is cleared, indicating that a newentry has been written to the buffer. It continues to read en-tries until it again reaches an entry that has the last entry flagset.

Watchpoints have the benefit that unlike breakpoints, theyare not intrusive to the application. The application does nothave to interrupt its execution when a watchpoint trigger ismet. However, this also means that there is a threshold forthe rate of triggers that AVEKSHA can keep up with. Be-yond this rate, the circular buffer will wrap around and someevents of interest will be missed. Based on our empiricalmeasurements (given in Table 4.1), the entire processing forone invocation of of a watchpoint trigger takes 30.5µs, whichcorresponds to 122 cycles for the application processor at 4MHz. Thus, as long as we do not have a sustained burst of8 events of interest within 8×122 = 976 cycles, the TDB inthe watchpoint mode will not miss any event.

!"#$

!%&'$()*$!'+,$,-!'./,*0()*1$ +/,*0()*1$ '234560()*1$

7/%/,*0891$ 9/,*0891$

:9/,*0.91$ '2345+0.91$;8$

!"#$

!%&'$-<=+0891$,-!'./,*0891$ +/,*0.91$ '2345+0.91$

7/%/,*0)>).881$

9/,*0891$

:9/,*0()*1$ '234560()*1$;?$

;8$

Figure 7. TinyOS clock module setup (top) and modifiedsetup for PC polling (below).

3.4 PC Polling ModeThe final approach to trace generation, is to forego us-

ing triggers entirely, and instead poll the program counter ofthe application processor. With this approach, each polledPC value is used to determine what section of the code theapplication processor is executing in. For example, the func-tion lookup described in section 2.4 finds the function corre-sponding to a PC value. The advantage of PC polling is that itis about 19 times faster than the watchpoint mode and hencecan keep pace with a higher frequency of events, such as ev-ery function transition. From Table 2, we see that it takesjust 1.6µs to complete a single PC poll, which correspondsto 6.4 clock cycles of the application processor. This meansthat it is possible to miss function transitions shorter than 7clock cycles. However, with most instructions taking morethan one cycle, such a short time between function transi-tions is unusual. A disadvantage of PC polling is that it doesnot allow for advanced triggers – it only allows reading thePC value. For example, we could not use PC polling aloneto watch for a memory read or write to a specific memorylocation.

Implementing PC polling presents a practical challenge,namely, synchronizing the TDB and the application mote.This is because the TDB is reading the PC values while themote is executing. To achieve this synchronization, we re-place the mote’s external crystal oscillator with a wire fromthe FPGA. Then, we modify how TinyOS configures the pro-cessor’s clock module, as shown in Figure 7, so that themain clock (MCLK) on the processor is wired directly to theFPGA’s clock. This achieves synchronization of the TDBand the application processor.

In addition to reconfiguring the application processor’smain clock (MCLK), we must also change the source ofTimerA which requires a 1MHz clock signal, and TimerBwhich requires a 32kHz clock. The FPGA’s 4MHz signalcannot be divided down to 32kHz, because LFXT1CLK andACLK each have a maximum clock divider of 8. Therefore,we use the internal digitally controlled oscillator (DCOCLK)to provide a 32kHz signal to TimerB and connect TimerA toACLK and remove the external crystal from the applicationprocessor board. A consequence of this is that DCOCLKcannot be turned off when the processor goes to sleep, be-cause TimerB is responsible for waking the processor up.Another solution would be to connect the FPGA clock to theXT2 clock input pin. This would avoid removing the crystaloscillator and would allow TimerB to use the oscillator as its

Mode Operation Software (µs) FPGA (µs)

Breakpoint Test 43 3.8Read Addr. 140 12.2Resume 77 6.8Total 260 22.8

Watchpoint Test 200 18.3Read Addr. 140 12.2Total 340 30.5

PC Polling Read PC 48 1.6Total 48 1.6

Table 2. Time taken, in software and using the FPGA, toperform various operations through JTAG in the break-point, watchpoint, and PC polling modes.

input, so the mote would not need DCOCLK in sleep mode.We chose the first approach for practical reasons, because theXT2 pin is physically less accessible than the crystal oscilla-tor.

4 Experiments4.1 Microbenchmarks

The objective of our microbenchmarking experiments isto evaluate the performance of the building blocks of AVEK-SHA. In particular, we evaluate (a) how many clock cycles ittakes for AVEKSHA to perform event monitoring for each ofthe three modes – breakpoint, watchpoint, and PC polling,(b) the accuracy of the energy monitoring by comparing itwith measurements obtained using a Fluke multimeter anda dedicated power monitor from Monsoon Inc., and (c) theenergy consumption of the TDB itself.Time Taken in Each Monitoring Mode: Ideally, we wouldlike to poll the PC or the watchpoint buffer at a rate sufficientto observe every instruction executed on the application pro-cessor. Unfortunately, the MSP430’s OCDM was designedto be operated with a maximum frequency of 10MHz for theJTAG clock. One exception we have discovered empiricallyis that PC polling can be reliably clocked at up to 24MHz.Table 2 presents the effect this has on the time taken to com-plete basic polling operations. The software column showshow long operations take if only the MCU on the TDB is be-ing used while the FPGA column represents the time opera-tions take in the current FPGA implementation. The FPGAimplementation is limited only by how fast the JTAG clockof the application processor can be reliably driven. The ta-ble presents results in µs. TinyOS operates the main clock at4MHz by default, so 1µs=4 clock cycles.

The breakpoint mode of monitoring comprises a test op-eration to determine if the application processor has halted(which is done in a loop), a read address phase to collect theinstruction address at which the halt occurred, and a resumephase to restart execution. Likewise, the watchpoint modehas a test phase and a read address phase. The test phasehere is more complex because it has to test if a new entry hasbeen created in the JTAG circular buffer. In total, a watch-point poll takes 30.5µs, or 122 cycles of the application pro-cessor. For all of the applications that we have experimentedwith (which are a superset of the ones for which we pro-

Resistance Current Relative Error(Ohms) Computed (µA) TDB (µA) (Unitless)

179.71 18362.92 18483.56 0.007218.55 15099.52 15193.74 0.006560.8 5884.45 5807.06 0.013991.4 3328.63 3314.46 0.004

4689.2 703.74 674.77 0.04132610 101.20 92.34 0.08855220 59.76 52.97 0.114

179360 18.40 16.09 0.125266750 12.37 16.28 0.316

Table 3. Accuracy of the current measurements providedby TDB for fixed resistive loads, compared to values com-puted based on measurements with a Fluke multimeter.

vide results here), the rate of events, tasks, and application-level functions is lower than the above rate. However, if weinclude system-level entities (functions, events, and tasks),then this rate is occasionally exceeded. Finally, PC pollingonly requires a read PC operation that can be performed inunder 7 cycles of the application processor.Accuracy of Power and Energy Monitoring: The objec-tive of this experiment is to see if AVEKSHA can faithfullymonitor the power draw in the static case (using a fixed resis-tive load) and when there are spikes in power consumption,which happen commonly in embedded systems, e.g., whenthe radio switches on. Table 3 shows the current consump-tion reported by the TDB for various resistive loads. Forcomparison, we measured the value of each resistor using ahigh-accuracy Fluke multimeter and computed the theoret-ical current consumption through it. As seen in the table,TDB’s current measurement is within 10% of the computedvalue for current draws of 100 µA or above, while the errorgoes up for smaller current values. The accuracy of the cur-rent measurement can be improved further using techniques(which we have not implemented yet) such as better decou-pling of analog and digital components, and use of a groundplane.

We also measured the power consumption reported bythe TDB while attached to a Telos mote running theTestNetworkLpl TinyOS application. It is important to notethat the amplitude of the power consumption trace in thiscase will have a significant dynamic range due to variouscomponents on the mote changing power states during ap-plication execution. The TDB measurement of energy drawover a 1 minute period is within 3.2% of that given by a Mon-soon power meter [11]. Since the spikes in energy draw arethree orders of magnitude higher than the steady state case,this close result can only be achieved because TDB monitorsthe current spikes faithfully. For comparison, the static en-ergy metering of iCount over a five decade range is accurateto 20% down to about 1 µA [12].Power Consumption of TDB: It is important that the TDBitself consume a small amount of power, because the typicalusage scenario is the TDB coupled to the application pro-cessor board when deployed in the field. A large source ofenergy saving on the application processor comes from en-

tering a low-power sleep state when not in use. With a smallmodification of the application processor’s code, to signalwhen it is in sleep mode, it is also possible for the TDB toenter a low-power sleep state.

A general purpose pin on the application processor is usedto signal to the TDB when to enter sleep mode. The sleepsignal is connected to the Flash Freeze pin on the FPGA.Flash Freeze is a feature of Actel’s IGLOO nano FPGAs,that allow them to enter a low power state while still hold-ing output pins at their previous state. Because the JTAGcontrolled OCDM is implemented as an externally clockedstate machine, as long as the JTAG I/O lines remain staticwhile in sleep mode, the OCDM will have the same stateupon wakeup. The FPGA is able to recover from sleep modewithin 1 µs.

We have implemented the necessary code to signal sleepmode in TinyOS. Only system code is affected, the applica-tion does not change. In total, we added only 10 lines of codeto TinyOS. First, we initialized the general purpose pin GIO0for output mode in MotePlatformC. Then we modified themacro TOSH SIGNAL, which is used to create interrupt han-dlers, such that the beginning of every interrupt sets GIO0 toindicate a transition to the awake state. Finally, we modifiedMcuSleepC to clear GIO0 just before transitioning to sleep.

The TDB consumes 55mW when active. The current im-plementation does not have the ability to disable the USBand energy monitoring components, which would be desir-able when the TDB is in sleep. However, we have mea-sured power consumption with these components removedat 18mW. Of this, 11mW are due to the 48MHz oscillator.This could be significantly reduced with a lower frequencyoscillator, multiplied to 48MHz by using the FPGA’s Phase-Locked Loop (PLL) module.4.2 Application Setup

Our experiments use both TinyOS and Contiki applica-tions without needing any extra programming effort sinceAVEKSHA is OS-agnostic by design. The only change tothe OSes is the one required due to the change of the clocksource that was required to synchronize the JTAG and appli-cation processor’s clocks for PC polling (as shown in Figure7). The clock initialization module is responsible for creat-ing the main processor clock from the hardware clock and forwiring other internal hardware clocks to different sources.This module has to be changed in the OS for AVEKSHA towork with our clock setup.

We use two TinyOS applications TestNetworkLpl andTestFtsp and an object tracking application in Contiki.TestNetworkLpl uses the collection tree protocol to pushsensor readings to a base station [13]. The wakeup periodis set to 128ms, which indicates how often the radio wakesup to test the channel for transmitting nodes. This is a typ-ical application for sensor networks. We present a bug thatwas uncovered when we used the TDB to monitor all tasksin the watchpoint mode. TestFtsp is a time synchronizationprotocol [14]. Finally, LightTracker is a Contiki object track-ing application takes light readings at each node and passesvalues that reach a threshold to the base station through amulti-hop routing protocol. The radio wakeup period is 125ms [15].

Figure 8. Watchpoint trace of states when sending a mes-sage in TestNetworkLpl, showing the application, low-power-listening, and radio layers. The number aboveeach state’s timeline corresponds to the numbering of thestates under the timeline. For example, in the low-powerlisten layer, state 1 is S OFF and 2 is S ON; at the beginningthe state is 1, then an extended period of state 2, followedby a return to state 1.

4.3 WatchpointsUsing States to Monitor Energy: One application for theTDB is to monitor various state variables. Monitoring statevariables and transitions is useful because they can be corre-lated to power consumption, and can aid in understanding thebehavior of applications (as argued in Quanto as well [4]).This is particularly true in TinyOS, where the event-drivenmodel encourages the use of explicit state machines.

In TestNetworkLpl, we have instrumented the applica-tion layer, low-power-listening layer, and the radio layer tomonitor state changes. The instrumentation is simply toplace a nop instruction which can be used as a trigger inthe watchpoint mode. In the application layer, the begin-ning of every task and event handler is instrumented. In thelow-power-listen layer, the state changes of interest are inthe RadioPowerState module. This uses the state compo-nent interface in TinyOS, which we instrumented with nopinstructions. In the radio layer, state variables have the post-fix m state. We wrote a script that finds all assignments tothese variables in the code and inserts a nop statement afterthe assignment.

Figure 8 shows a packet send that is initiated from theapplication layer when Timer.fired is triggered. Thefirst step is to turn on the radio by starting the voltageregulator and oscillator, which is given by the state variable

Figure 9. Watchpoint trace of task executions during aradio start event. The PowerCycleP startRadio task iscalled over 3000 times due to a bug in the handling of theCC2420CsmaP SplitControlState.

CC2420ControlP m state. The start up takes 1.6 ms,the duration of the VREG STARTING, VREG STARTED, andXOSC STARTING states. The CC2420TransmitP m stateshows the process of transmitting a message. Themessage transmission takes place during the statesS BEGIN TRANSMIT and S EFD. After the message istransmitted, the sender waits for an acknowledgment, whichis shown in the CC2420ReceiveP m state. This variableshows the acknowledgment being received at 12 ms (theS RX FCF state) after which it is read off from the radiolayer (the S RX PAYLOAD state). After this, the radio turnsoff at 32ms. A parameter of LPL controls the delay afterreceive and the default is set to 20ms which is verified byour experiment. This kind of low-level tracing of eventsin the stacks is useful for a developer wanting to get adetailed understanding of how a high-level function isaccomplished (in this case, transmission of a message whichrequires an acknowledgment). Such an understanding canbe used for performance tuning (speeding up some eventin the time line, or reducing the amount of time spent in aparticular state) or for energy optimization (knowing someenergy-expensive state, reduce the amount of time the nodespends in that state). This level of tracing would be verydifficult to obtain through purely software means because ofthe fine-level of instrumentation that will be required, andcorrespondingly the high level of perturbation that will becaused to the normal execution of the application. On theother hand, AVEKSHA does not have to make tightly coupledchanges to the hardware (the radio in this case), which aredifficult to make and in some cases impossible when thehardware or the firmware is closed source.Using Tasks to Debug an Application: The original objec-tive of this experiment was to trace the collection tree pro-tocol in the watchpoint mode. However, during the trac-ing, we observed some suspicious behavior that caused usto suspect that there was a bug in the low power listen-ing layer of TinyOS. This was discovered by instrumentingall of the tasks in TinyOS for the TestNetworkLPL appli-cation. The nesC compiler in TinyOS creates a function

Figure 10. Watchpoint trace of task executions with thestartRadio bug fixed.

6

0

1

2

3

4

5

Tim

e (m

s)

CtpRoutingEngineP__sendBeaconTask

DefaultLplP__startRadio

CC2420CsmaP__startDone_task

PowerCycleP__getCcaDefaultLplP__send

PowerCycleP__startRadio

Figure 11. Execution timeline that causes task spinning.

called SchedulerBasicP TaskBasic runTask that con-tains a switch statement with a case for every task. By in-serting a nop into each case, we can monitor every time anytask is executed. We originally found that AVEKSHA wasunable to keep pace with the rate of events that is generatedafter the mote is started up. Later, it turned out that this wasdue to a bug where some tasks were being repeatedly andunnecessarily re-posted.

Figure 9 shows a trace of the tasks shortly after the motestarts up. Three of the tasks (PowerCycleP startRadio,DefaultLplP resend, and CC2420CsmaP sendDonetask) are stuck in a spin for 128ms. This implies that these

tasks keep re-posting themselves and do not get any usefulwork done in each execution of the task. The spinningtasks are the result of the order of events that happen whenthe mote starts up. The timeline in Figure 11 shows therelevant events. First the collection tree protocol routingengine CptRoutingEngineP posts a task sendBeaconTaskto send a beacon. This task results in the LPL moduleDefaultLplP posting a startRadio task. The radio is dulystarted after 3.4ms and the radio layer CC2420CsmaP sendsa signal to DefaultLplP and PowerCycleP that the radio isstarted. PowerCycleP receives the signal first and schedulesthe PowerCycleP getCca task, after which DefaultLplPreceives the signal and schedules the DefaultLplP sendtask. Tasks are executed in the order they are scheduled inTinyOS. When PowerCycleP getCca executes it sched-ules PowerCycleP startRadio, which is executed afterPowerCycle send. The task PowerCycleP startRadio

fails, because the radio is already started. This is preciselywhere the bug lies. The radio has already been started andtherefore this task should not re-post itself, but should returnwithout doing anything. Eventually, when the sending of thebeacon message has completed, the radio is set off to sleepand the PowerCycleP startRadio task succeeds. This canbe seen from the CC2420 state of S STARTED toward the endof the timeline (when it is started a second time to performthe CCA).

The buggy version of PowerCycleP startRadio isshown first.static inline void PowerCycleP__startRadio__runTask(void) {if (PowerCycleP__SubControl__start() != SUCCESS) {

PowerCycleP__startRadio__postTask();}

}The undesirable effect of the bug is that fills a slot in the

task queue (though a redesign in TinyOS 2.x limits this ef-fect) and a task is being re-posted and invoked uselessly thususing up CPU resources.

The above is a real-case where the bug is acti-vated. We hypothesize the following plausible appli-cation case where the bug will be activated and thePowerCycleP startRadio task will never succeed andwill keep spinning endlessly. Consider an application thatstarts sending a message and shortly afterwards (after the ra-dio has finished S STARTING and entered S STARTED state)turns off low power listening. A low power listen intervalof 0 indicates that low power listening should be shut offand the radio left on in receive mode. In this case, the taskPowerCycleP startRadio will never have the SUCCESScondition and will continue to spin until the low power listeninterval is again changed. We have confirmed that this hap-pens when the following synthetic application is executed.event void RadioControl.startDone(error_t err) {sendMessage();// Schedule the timer to fire while// PowerCycleP__startRadio is spinningcall Time.startOneShot(100);

}event void Timer.fired() {call LowPowerListening.setLocalWakeupInterval(0);

}To fix this bug, consider what happens

to the state of the CC2420 radio (shown asCC2420CsmaP SplitControlState in Figure 9). Thefunction PowerCycleP SubControl start() tries tostart the radio and tests the state of CC2420. If the stateis STARTING it returns SUCCESS, if the state is STARTEDit returns EALREADY, and if the state is anything elseit returns EBUSY. Therefore, the simple fix to the taskPowerCycleP startRadio is as follows.static inline void PowerCycleP__startRadio__runTask(void) {if (PowerCycleP__SubControl__start() != SUCCESS&& PowerCycleP__SubControl__start() != EALREADY) {

PowerCycleP__startRadio__postTask();}

}Figure 10 shows that this fix indeed stops

PowerCycleP startRadio from spinning (except forthe short time the radio is in the transmitting state).Processes in Contiki: An advantage of our approach oversoftware tracing is that it is independent of the OS being

Figure 12. Watchpoint trace of application level func-tions and threads of a sender node in the Contiki trackingapplication.

local2Global floatsisfcallcallcallcall

callmulsf3

callcallcallcall

fixfsi

is_synced

pack_f

unpack_f

fpmul_parts

call

Figure 13. Call graph of the local2Global function inFTSP.

used. Without any modification to the Contiki OS, the TDBis able to generate a trace of an application. In Figure 12,we show a trace of a sender node of a simple object trackingapplication called LightTracker [15], implemented in Con-tiki [16] version 2.4. LightTracker tracks a moving lightsource in a sensor network. There are two types of nodespresent in the network: a base station and a set of sendernodes. A sender node periodically (every 2 seconds) collectslight intensity using its light sensor and forwards it to thebase station, possibly in a multi-hop manner, if the sensedvalue is above a threshold. The base station periodicallychecks the received samples and selects the node with themaximum light intensity. The selected node is considered tobe the current position of the light source.

Unlike TinyOS, Contiki features the use of threads asa key design component. This reduces the need of main-taining explicit state machines in the code. In the senderapplication, we place a nop at the starting point of ev-ery thread command. PROCESS BEGIN represents the cre-ation of a thread and is performed once at time 11 sec-onds. PROCESS THREAD is executed every time the threadis started. PROCESS WAIT EVENT UNTIL is a wait statementin the thread that blocks until 2 seconds have passed on thetimer. The power spikes correspond well with the 125ms ra-dio wakeup period.4.4 PC Polling

Sampling the PC counter is a quick and non-intrusive op-eration. It does not have the flexibility of setting watchpointtriggers for specific conditions; however, it has the advantageof being able to measure events with greater timing accuracythan watchpoint polling. This is both because it is faster totake a sample of the PC counter and because it does not haveto do buffer management.

A useful application of PC polling is for statistical profil-

Figure 14. Trace of the functions invoked in one execu-tion of the local2Global function. Filled squares repre-sent data collected from PC polling, and open rectanglesrepresent the inferred executions of each function.

ing of an application, say to determine what parts of the codeare most active. Raw PC polling data cannot directly give aprofile of the number of times a function is called or the totaltime the function takes to execute, inclusive of the times ofthe nested functions that it calls.

However, it is sometimes possible to combine knowledgeof the call graph with the raw PC polling data to discover howmany times a function is called and how long each instanceexecutes. As an example, we examine the call graph ofthe local2Global time conversion function, in the TinyOSimplementation of the synchronization protocol FSTP. Thisfunction requires a few calls to expensive 32-bit math func-tions. The order of calls in this function is fixed, and is shownin figure 13. In figure 14, the logged events generated fromPC polling are shown as filled squares. PC polling uses afunction lookup table, as described in Section 2. Every timea new function is entered an event is logged. Because the PCpolling rate is equivalent to 6.4 application processor cycles,we are likely to capture every transition between functions.Combining the events with the call graph we can determinewhen functions start and end as shown by the open rectan-gles.

For this experiment, events are timestamped with a preci-sion of 1 µs, which is less than the 1.6 µs a single PC polltakes. From the start of the execution of local2Global, tothe call of floatsisf, the TDB recorded 5 µs (equivalent to20 cycles of the application processor). A manual inspectionof the assembly code reveals that the correct cycle count is19, which is within the expected margin of error.

4.5 Overhead of a Simple Software ProfilerSoftware profiling is used to collect and arrange differ-

ent statistics about function calls in a program, such as thetime spent in each function, how many times a function wascalled, etc. A prominent example of a software profiler isgprof [17], which is widely used in Unix systems. In thisexperiment, we measure the overheads that a software pro-filer can introduce in an embedded wireless node. We createa simple software profiler following the principles outlinedin several papers [6, 1]. In it, profiling is performed by in-strumenting with additional code, the entry and exit points ofthe functions that we are interested in. The additional codecollects the time spent inside those functions.

For this experiment, we use the LightTracker object track-

static int read_light() {! uint16_t start_time = (uint16_t)clock_time();! SENSORS_ACTIVATE(light_sensor);! int val = light_sensor.value(LIGHT_SENSOR_PHOTOSYNTHETIC);! SENSORS_DEACTIVATE(light_sensor);! uint16_t end_time = (uint16_t)clock_time();! count_cycles[FID_READ_LIGHT] += end_time - start_time;! return val;!}!

1!2!3 !4!5!6!7!8!9!

Figure 15. A function with additional code (marked inred) for software profiling.

ing application introduced earlier and instrument the func-tion, read light(), that is used by a sender node to collectthe light intensity. Figure 15 shows the modified function.The newly added code for profiling is shown in red. Ba-sically, the added code takes a time-stamp (lines 2 and 6)at the entry and the exit points of the function and storestheir difference in a memory location (line 7). It should benoted that the time estimation is correct for a non-recursivefunction. For recursive functions, additional code is neededto keep track of the depth of the call in the recursion stack,which will incur additional overheads. Further, we only pro-file function entry and exit points and not other events of in-terest that AVEKSHA profiles, such as, point of an interrupt,data read from a peripheral, etc. From this perspective, ourestimation of the overhead of the software-based profiler isconservative and provides a baseline for any implementationof a software-based profiler.

We simulate and profile LightTracker, both for the orig-inal and the instrumented version of read light, usingthe Cooja [18] simulator for 500 seconds. In both cases,read light was called 249 times. The average run-timefor the regular version was 1,827 clock cycles per call. Theversion with code instrumentation for profiling spent 1,896clock cycles on average. So, the additional code intro-duced an average overhead of 69 clock cycles per call. Incomparison, AVEKSHA adds no overhead to the applicationprocessor, except in the case of nop triggered watchpoints,which execute in a single cycle. In our simulation, a totalof 6,52,109 calls for 158 different functions were reported.If we had instrumented each of those 158 functions, whichwould be the case for a complete profiler, the overhead wouldhave increased execution time by 182%. The instrumenta-tion of read light also increased the binary size by 201bytes. For resource limited sensor nodes, the additional over-heads, both in terms of run-time and binary size, are signifi-cant.

AVEKSHA avoids the overheads accociated with a soft-ware profiler because the application processor is relieved ofthe burden of performing profiling-related tasks, which arehandled by the TDB instead.

5 DiscussionTable 4 shows the cost of the components if the board

was created in quantities of 1000. For the purpose of creat-ing a flexible prototype, we used more powerful processorsthan are required for the current implementation. The MCUfirmware uses 8.7kB program ROM and 5kB RAM, whichare 18% and 49% of the MSP430’s total capacity respec-tively. The FPGA firmware uses 35% of the core logic and

Functionality Cost (USD)

MCU/FPGA 21.19Energy Monitoring 7.30USB Connectivity 6.99Misc 9.15Total 44.63

Table 4. Cost in quantities of 1000.

2.8kB or 62% of the RAM of the Actel FPGA.The target mode of operation is the TDB coupled with

the application processor board in field deployments. In thismode of operation, there is the consideration of where tostore the logged events. For debugging, it is sometime suf-ficient to have a history of the last few events before somecondition in the application was reached. For example, theOCDM of the MSP430 provides a history of the last 8 watch-point events with this type of debugging in mind. If this isthe case, the TDB’s main processor can maintain a circularbuffer of events in RAM. If a larger history is required itis possible to store events into the TDB MCU’s flash mem-ory. For very long term storage an external USB storage hostcould be attached to the TDB or a compressed trace can evenbe sent to a computing platform over wireless communica-tion.

An objective of AVEKSHA is to monitor the applicationprocessor without interfering with its execution. While thebreakpoint mode is not suitable from this standpoint, it doeshave potential application for security and reliability. For ex-ample, breakpoints can be used to implement memory pro-tection. The MSP430 has no memory protection and bufferoverflow code injection exploits are known [19]. The attackworks by injecting code into the stack and getting the pro-cessor to execute this code. Except for special cases, suchas some boot loaders, the application processor should neverneed to execute code from RAM. All code is in flash whichhas a well-defined address range. By setting a breakpointfor an instruction fetch outside of this range, we can preventsuch code injection exploits.

Our discussion of implementation details of AVEKSHAhas been specific to the MSP430 MCU. This is a shortcomingof the current implementation, though the concepts apply to abroad class of embedded processors. The requirement for anOCDM within the processor, which can be accessed throughthe JTAG interface is fundamental to our design. The specificpoints of dependence on the MSP430 are the exact format ofthe instructions used as event triggers, the address ranges ofthe peripherals whose reads we are interested in, and under-standing the state machine of the OCDM.6 Related Work

There are primarily three areas of work related to AVEK-SHA, namely power measurement, software and hardwarefor debugging sensor networks, and hardware support for de-bugging embedded systems.Power measurement: The problem of estimating or mea-suring power (or energy) consumption has been addressedextensively in the context of various electronic systems. Werestrict our discussion of prior work to techniques that specif-

ically target sensor networks. Various sensor network simu-lators, such as POWERTOSSIM, AVRORA, and COOJA pro-vide energy estimation capability based on pre-built powermodels of the target hardware platform. Measuring (as op-posed to estimating) the power consumed by a sensor node isusually done using the so-called sense resistor approach (seeSection 2). SPOT [3] is an energy meter for wireless sensornodes that is based on the sense-resistor approach and uses avoltage to frequency converter to transform the voltage sam-ples into an energy counter that can be read by the sensornode. iCount [12] is an energy meter design that targets sen-sor nodes that have a switching regulator. It provides energymetering capabilities at almost zero cost by just counting thecycles of the switching regulator. The Energy Endoscopeproject [5] uses a separate application-specific integratedcircuit (called EMAP2), implemented using a micro-powerfuse-based FPGA, to perform charge accumulation based onthe sense-resistor method. Similar to designs such as SPOT,AVEKSHA provides energy measurement capability with alarge dynamic range.

Quanto [4] builds on iCount by using regression mod-els to obtain per-component energy consumption based onthe aggregate measurement provided by iCount and also per-forms energy accounting to various application tasks throughcausal activity tracking. Activities are tracked by calls to asoftware-based logger. Currently, AVEKSHA does not havethe ability to assign energy consumption to individual activi-ties, however, AVEKSHA could replace the Quanto software-based logger by using watchpoints to indicate activities. Thiswould enable use of the Quanto algorithm to attribute energyconsumption.Sensor network debugging: Replay debugging is a wellknown technique for embedded systems [6] and has alsobeen proposed for sensor networks [1]. Envirolog presents asoftware-only solution for recording events to flash memory[7]. Applications are annotated to indicate what should berecorded, which a preprocessor then turns into C code. Dur-ing recording, 16 to 1024 bytes of RAM are used to bufferevents which are then stored to Flash. FlashBox adopts ahybrid hardware/software approach to eliminate the bottle-neck of writing to Flash [9]. In FlashBox, a second MCUand flash memory are added to provide dedicated record-ing. A recent software solution [2] has focused on a specifickind of tracing (control flow tracing) and combines intelli-gent static analysis with run-time trace compression to de-crease overhead. Nevertheless, this technique still requiresapplications to be instrumented to gather the tracing infor-mation. In contrast, AVEKSHA not only requires no modifi-cation to the application, but is also completely agnostic tothe OS used. The above techniques have the disadvantagethat they either perturb the timing behavior of the applica-tion, possibly suppressing some subtle bugs, or cause a largeslowdown in application execution.

Emulators such as AVRORA and COOJA provide the abil-ity to customize tracing and profiling of applications. InAvrora an interface is provided for writing custom plugins tomonitor the emulation. YETI extends COOJA with a GDBproxy [20]. This allows using the GDB debugging tool toset breakpoints and watchpoints in the code being emulated.

Similar extensibility could be added to AVEKSHA by provid-ing an API to customize three points: the watchpoint condi-tions set by JTAG, the event filtering algorithm performed bythe FPGA, and how the MCU stores the data.Hardware support for debugging embedded systems:Real-time trace functionality has been implemented in manyprocessor architectures. For example, the CoreSight TraceMacrocells provides hardware cores that can be added onas peripherals to an ARM-based system-on-chip to producea cycle-accurate trace of execution. This includes the abil-ity to collect and compress a large amount of trace data onchip and to transfer this data to a trace port interface unit,such as JTAG. The MSP430 MCU used in the Telos motehas a limited built-in OCDM. It is typically used by toolssuch as IAR [10] to record the last 8 instructions executedbefore a watchpoint or breakpoint. AVEKSHA goes beyondthis by demonstrating that it is possible to provide Core-Sight like functionality on a low end MCU commonly usedin wireless sensor nodes. Hardware designed to interface anOCDM to a host computer via the JTAG standard is oftenreferred to as an In-Circuit Emulator (ICE) or In-Circuit De-bugger (ICD), or more correctly, a JTAG adapter. Many ICEtools are available for the MSP430 processor family. An ex-ample, that inspired the authors to undertake this work, isthe open source GoodFET [21]. However, the GoodFET isonly capable of operations involving reading and writing toflash. The debugging features of the OCDM (e.g., break-points, watchpoints, state storage) are not documented byTexas Instruments, and the protocol to activate these featureswas reverse-engineered. There exist several point solutionsfor hardware meant for tracking different kinds of controlflow for the purpose of debugging, e.g., in [22], the authorsdesign a hardware ASIC that monitors loops taken by tasksin a multi-tasking environment and performs this in a non-intrusive manner to the application.

7 ConclusionIn this paper, we have presented AVEKSHA, a hardware-

software solution to the problem of tracing events at runtimein an embedded wireless node, without slowing down the ap-plication. AVEKSHA can trace a variety of events, such as,particular PC addresses, reads from peripherals, entry andexits from tasks and interrupt service routines, and arbitraryuser-defined events. We have shown through two applica-tions in TinyOS and one in Contiki that such tracing is use-ful for profiling the execution times of different tasks andevent handlers. While the watchpoint mode of operation iscapable of capturing a practically limitless variety of events,it cannot keep pace with events that occur more frequentlythan 122 clock cycles on a sustained basis. The PC pollingmode of operation is restricted in the kinds of events that itcan detect, but being faster it can keep pace with events thatoccur at a rate less than every 7 clock cycles. Our profiling oftasks uncovered a performance bug in the low power listenradio module of TinyOS, which we were able to fix. Further,AVEKSHA has the ability to do energy monitoring over theµA to mA range and coupling it to execution regions betweentwo events of interest.

This work points to the feasibility of tracing a wide vari-

ety of events of interest in a low-cost and non-intrusive man-ner while the embedded node is deployed in the field. We donot have to rely on expensive and custom-built hardware toachieve this. The events provided by AVEKSHA can be usedby a variety of existing and yet-to-be-developed solutions,such as replay-based debugging, performance profiling, andenergy monitoring.8 References[1] M. Wang, Z. Li, F. Li, X. Feng, S. Bagchi, and Y.-H. Lu, “Dependence-based

multi-level tracing and replay for wireless sensor networks debugging,” in Pro-ceedings of the 2011 SIGPLAN/SIGBED conference on Languages, compilersand tools for embedded systems, LCTES ’11, (New York, NY, USA), pp. 91–100, ACM, 2011.

[2] V. Sundaram, P. Eugster, and X. Zhang, “Efficient diagnostic tracing for wirelesssensor networks,” in Proceedings of the 8th ACM Conference on Embedded Net-worked Sensor Systems, SenSys ’10, (New York, NY, USA), pp. 169–182, ACM,2010.

[3] X. Jiang, P. Dutta, D. Culler, and I. Stoica, “Micro power meter for energy mon-itoring of wireless sensor networks at scale,” in Proceedings of the 6th interna-tional conference on Information processing in sensor networks, IPSN ’07, (NewYork, NY, USA), pp. 186–195, ACM, 2007.

[4] R. Fonseca, P. Dutta, P. Levis, and I. Stoica, “Quanto: tracking energy in net-worked embedded systems,” in Proceedings of the 8th USENIX conference onOperating systems design and implementation, OSDI’08, (Berkeley, CA, USA),pp. 323–338, USENIX Association, 2008.

[5] T. Stathopoulos, D. McIntire, and W. J. Kaiser, “The energy endoscope: Real-time detailed energy accounting for wireless sensor nodes,” in Proceedings ofthe 7th international conference on Information processing in sensor networks,IPSN ’08, (Washington, DC, USA), pp. 383–394, IEEE Computer Society, 2008.

[6] H. Thane, D. Sundmark, J. Huselius, and A. Pettersson, “Replay debugging ofreal-time systems using time machines,” in Parallel and Distributed ProcessingSymposium, IPDPS ’03, (Washington, DC, USA), pp. 288.2–, IEEE, 2003.

[7] L. Luo, T. He, G. Zhou, L. Gu, T. F. Abdelzaher, and J. A. Stankovic, “Achiev-ing repeatability of asynchronous events in wireless sensor networks with envi-rolog,” in INFOCOM 2006. 25th IEEE International Conference on ComputerCommunications. Proceedings, pp. 1 –14, april 2006.

[8] “Green hills software inc.” http://www.ghs.com/.[9] S. Choudhuri and T. Givargis, “Flashbox: a system for logging non-deterministic

events in deployed embedded systems,” in Proceedings of the 2009 ACM sympo-sium on Applied Computing, SAC ’09, (New York, NY, USA), pp. 1676–1682,ACM, 2009.

[10] “IAR Embedded Workbench for TI MSP430.” http://www.iar.com.[11] “Monsoon inc. power monitor.” http://www.msoon.com/LabEquipment/

PowerMonitor/.[12] P. Dutta, M. Feldmeier, J. Paradiso, and D. Culler, “Energy metering for free:

Augmenting switching regulators for real-time monitoring,” in Proceedings ofthe 7th international conference on Information processing in sensor networks,IPSN ’08, (Washington, DC, USA), pp. 283–294, IEEE Computer Society, 2008.

[13] O. Gnawali, R. Fonseca, K. Jamieson, D. Moss, and P. Levis, “Collection treeprotocol,” in Proceedings of the 7th ACM Conference on Embedded NetworkedSensor Systems, SenSys ’09, (New York, NY, USA), pp. 1–14, ACM, 2009.

[14] M. Maroti, B. Kusy, G. Simon, and A. Ledeczi, “The flooding time synchroniza-tion protocol,” in Proceedings of the 2nd international conference on Embeddednetworked sensor systems, SenSys ’04, (New York, NY, USA), pp. 39–49, ACM,2004.

[15] M. Hossain, A. Alim Al Islam, M. Kulkarni, and V. Raghunathan, “µsetl: Aset based programming abstraction for wireless sensor networks,” in InformationProcessing in Sensor Networks (IPSN), 2011 10th International Conference on,pp. 354 –365, april 2011.

[16] A. Dunkels, B. Gronvall, and T. Voigt, “Contiki - a lightweight and flexible op-erating system for tiny networked sensors,” in Local Computer Networks, 2004.29th Annual IEEE International Conference on, pp. 455 – 462, nov. 2004.

[17] S. L. Graham, P. B. Kessler, and M. K. McKusick, “gprof: a call graph executionprofiler,” in SIGPLAN Not., vol. 39, pp. 49–57, ACM, April 2004.

[18] F. Osterlind, A. Dunkels, J. Eriksson, N. Finne, and T. Voigt, “Cross-level sensornetwork simulation with cooja,” in Local Computer Networks, Proceedings 200631st IEEE Conference on, pp. 641 –648, nov. 2006.

[19] T. Goodspeed, “Msp430 buffer overflow exploit for wireless sensor nodes.”http://travisgoodspeed.blogspot.com/2007/08/machine-code-injection-for-wireless.html, August 2007.

[20] R. Huber, P. Sommer, and R. Wattenhofer, “Demo abstract: Debugging wirelesssensor network simulations with yeti and cooja,” in Information Processing inSensor Networks (IPSN), 2011 10th International Conference on, pp. 141 –142,april 2011.

[21] T. Goodspeed, “Goodfet.” http://goodfet.sourceforge.net, August 2011.[22] K. Shankar and R. Lysecky, “Control Focused Soft Error Detection for Embed-

ded Applications,” Embedded Systems Letters, IEEE, vol. 2, no. 4, 2010.

AVEKSHA: A Hardware-Software Approach for Non-intrusive Tracing

Documents

Transcript of AVEKSHA: A Hardware-Software Approach for Non-intrusive Tracing