A 128 channel Extreme Learning Machine based Neural ... · C] denotes the L×C matrix of output...

13
arXiv:1509.07450v2 [cs.LG] 27 Sep 2015 1 A 128 channel Extreme Learning Machine based Neural Decoder for Brain Machine Interfaces Yi Chen, Student Member, IEEE, Enyi Yao, Student Member, IEEE, Arindam Basu, Member, IEEE Abstract—Currently, state-of-the-art motor intention decoding algorithms in brain-machine interfaces are mostly implemented on a PC and consume significant amount of power. A machine learning co-processor in 0.35-μm CMOS for the motor intention decoding in the brain-machine interfaces is presented in this paper. Using Extreme Learning Machine algorithm and low- power analog processing, it achieves an energy efficiency of 3.45 pJ/MAC at a classification rate of 50 Hz. The learning in second stage and corresponding digitally stored coefficients are used to increase robustness of the core analog processor. The chip is verified with neural data recorded in monkey finger movements experiment, achieving a decoding accuracy of 99.3% for movement type. The same co-processor is also used to decode time of movement from asynchronous neural spikes. With time-delayed feature dimension enhancement, the classification accuracy can be increased by 5% with limited number of input channels. Further, a sparsity promoting training scheme enables reduction of number of programmable weights by 2X. Index Terms—Neural Decoding, Motor Intention, Brain- Machine Interfaces, VLSI, Extreme Learning Machine, Machine Learning, Neural Network, Portable, Implant I. I NTRODUCTION Brain-machine interfaces (BMI) are becoming increasingly popular over the last decade and open up the possibility of neural prosthetic devices for patients with paralysis or in locked-in state. As depicted in Fig. 1, a typical implanted BMI consists of a neural recording IC to amplify, digitize and transmit neural action potentials (AP) recorded by the micro- electrode array (MEA). Significant effort has been dedicated to develop energy efficient neural recording channel in recent years for long-term operation of the implanted devices [1] [2] [3] [4]. Some recent solutions have also integrated AP detection [5] [6] [7] [8] and spike sorting features [9] [10] [11]. However, in order to produce an actuation command (e.g. for a prosthetic arm), the subsequent step of motor intention decoding is required to map spike train patterns acquired in the neural recording to the motor intention of the subjects. Though various elaborate models and methods of motor intention decoding have been developed in past decades with the goal of achieving high decoding performance [12] [13] [14], the state-of-the art neural signal decoding are mainly con- ducted on PC consuming a considerable amount of power and making it impractical for the long-term use. With on-chip real- time motor intention decoding, the size and the power con- sumption of the computing device can be reduced effectively Yi Chen, Enyi Yao, and Arindam Basu are with Centre of Excellence in IC Design (VIRTUS), School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore 639798 (e-mail:[email protected], [email protected]). A F E A D C DSP Tx Rx A F E A D C D S P Tx Rx Traditional design Envisioned design MLCP ML Actuator Tx/Rx Actuator Tx/Rx Skull Skull Fig. 1: Comparison of envisioned and traditional implanted BMI: The envisioned system uses a machine learning co-processor (MLCP) along with the DSP used in traditional neural implants to estimate motor intentions from neural recordings thus providing data compression. Traditional systems perform such decoding outside the implant and use bulky computers. and the solution becomes truly portable. Furthermore, integrat- ing the neural decoding algorithm with the neural recording device is also desired to reduce the wireless data transmission rate and make the implanted BMI solution scalable as required in the future [15]. Until now, very few attempts have been made to give a solution for this problem. A low-power motor intention architecture using analog computing is proposed in [16], featuring an active filtering with massive parallel computing through low power analog filters and memories. However, no measurement results are published to support the silicon viability of the architecture. A more recent work proposes a universal computing architecture for the neural signal decoding [17]. The architecture is implemented on a FPGA with a power consumption of 537 μW. In this paper, we present a machine learning co-processor (MLCP) achieving low-power operation through massive par- allelism, sub-threshold analog processing and careful choice of algorithm. Figure 1 contrasts our approach with traditional approaches: our MLCP acts in conjunction with the digital signal processor (DSP) already present in implants (for spike sorting, detection and packetizing) to provide the decoded outputs. The bulk of processing is done on the MLCP while simple digital functions are performed on the DSP. Compared to traditional designs that perform the decoding outside the implant, our envisioned system that provides opportunity for huge data compression by integrating the decoder in the implant. The MLCP is characterized by measurement and the decoding performance of the proposed design is verified with data acquired in individuate finger movements experiment of monkeys. Some initial results of this work were presented in [18]. Here, we present more detailed theory, experimental results including decoding time of movement, new sparsity

Transcript of A 128 channel Extreme Learning Machine based Neural ... · C] denotes the L×C matrix of output...

Page 1: A 128 channel Extreme Learning Machine based Neural ... · C] denotes the L×C matrix of output weights. While the output o k can be directly used in regression, for classification

arX

iv:1

509.

0745

0v2

[cs.

LG]

27 S

ep 2

015

1

A 128 channel Extreme Learning Machine basedNeural Decoder for Brain Machine InterfacesYi Chen, Student Member, IEEE, Enyi Yao,Student Member, IEEE, Arindam Basu,Member, IEEE

Abstract—Currently, state-of-the-art motor intention decodingalgorithms in brain-machine interfaces are mostly implementedon a PC and consume significant amount of power. A machinelearning co-processor in 0.35-µm CMOS for the motor intentiondecoding in the brain-machine interfaces is presented in thispaper. Using Extreme Learning Machine algorithm and low-power analog processing, it achieves an energy efficiency of3.45 pJ/MAC at a classification rate of 50 Hz. The learningin second stage and corresponding digitally stored coefficientsare used to increase robustness of the core analog processor.The chip is verified with neural data recorded in monkey fingermovements experiment, achieving a decoding accuracy of99.3%for movement type. The same co-processor is also used todecode time of movement from asynchronous neural spikes. Withtime-delayed feature dimension enhancement, the classificationaccuracy can be increased by5% with limited number of inputchannels. Further, a sparsity promoting training scheme enablesreduction of number of programmable weights by≈ 2X.

Index Terms—Neural Decoding, Motor Intention, Brain-Machine Interfaces, VLSI, Extreme Learning Machine, MachineLearning, Neural Network, Portable, Implant

I. I NTRODUCTION

Brain-machine interfaces (BMI) are becoming increasinglypopular over the last decade and open up the possibility ofneural prosthetic devices for patients with paralysis or inlocked-in state. As depicted in Fig. 1, a typical implantedBMI consists of a neural recording IC to amplify, digitize andtransmit neural action potentials (AP) recorded by the micro-electrode array (MEA). Significant effort has been dedicatedto develop energy efficient neural recording channel in recentyears for long-term operation of the implanted devices [1][2] [3] [4]. Some recent solutions have also integrated APdetection [5] [6] [7] [8] and spike sorting features [9] [10][11]. However, in order to produce an actuation command (e.g.for a prosthetic arm), the subsequent step of motor intentiondecoding is required to map spike train patterns acquired inthe neural recording to the motor intention of the subjects.

Though various elaborate models and methods of motorintention decoding have been developed in past decades withthe goal of achieving high decoding performance [12] [13][14], the state-of-the art neural signal decoding are mainly con-ducted on PC consuming a considerable amount of power andmaking it impractical for the long-term use. With on-chip real-time motor intention decoding, the size and the power con-sumption of the computing device can be reduced effectively

Yi Chen, Enyi Yao, and Arindam Basu are with Centre of Excellence in ICDesign (VIRTUS), School of Electrical and Electronic Engineering, NanyangTechnological University, Singapore 639798 (e-mail:[email protected],[email protected]).

A

F

E

A

D

C DSP

Tx

Rx

A

F

E

A

D

C

D

S

P

Tx

Rx

Traditional design

Envisioned design

MLCP

ML

Actuator

Tx/Rx

Actuator

Tx/Rx

Skull

Skull

Fig. 1: Comparison of envisioned and traditional implantedBMI: The envisioned system uses a machine learning co-processor(MLCP) along with the DSP used in traditional neural implants toestimate motor intentions from neural recordings thus providing datacompression. Traditional systems perform such decoding outside theimplant and use bulky computers.

and the solution becomes truly portable. Furthermore, integrat-ing the neural decoding algorithm with the neural recordingdevice is also desired to reduce the wireless data transmissionrate and make the implanted BMI solution scalable as requiredin the future [15]. Until now, very few attempts have beenmade to give a solution for this problem. A low-power motorintention architecture using analog computing is proposedin [16], featuring an active filtering with massive parallelcomputing through low power analog filters and memories.However, no measurement results are published to supportthe silicon viability of the architecture. A more recent workproposes a universal computing architecture for the neuralsignal decoding [17]. The architecture is implemented on aFPGA with a power consumption of537 µW.

In this paper, we present a machine learning co-processor(MLCP) achieving low-power operation through massive par-allelism, sub-threshold analog processing and careful choiceof algorithm. Figure 1 contrasts our approach with traditionalapproaches: our MLCP acts in conjunction with the digitalsignal processor (DSP) already present in implants (for spikesorting, detection and packetizing) to provide the decodedoutputs. The bulk of processing is done on the MLCP whilesimple digital functions are performed on the DSP. Comparedto traditional designs that perform the decoding outside theimplant, our envisioned system that provides opportunity forhuge data compression by integrating the decoder in theimplant. The MLCP is characterized by measurement and thedecoding performance of the proposed design is verified withdata acquired in individuate finger movements experiment ofmonkeys. Some initial results of this work were presentedin [18]. Here, we present more detailed theory, experimentalresults including decoding time of movement, new sparsity

Page 2: A 128 channel Extreme Learning Machine based Neural ... · C] denotes the L×C matrix of output weights. While the output o k can be directly used in regression, for classification

2

o1

1 j L

x1 xD

βkj

H1 Hj HL

wij

ELM

oM

Movement

Onset G(tk)

Movement type S(tk):

label of maximum output

Moving average of input spikes

oM+1

θ

(a) (b)

o1

1 j L

x1 xD

βkj

H1 HjHL

wij

g(·)

g(·)

g(·)

oC

MLCP

DSP

Fig. 2: Algorithm: (a) The architecture of the Extreme LearningMachine (ELM) with one nonlinear hidden layer and linear outputlayer. (b) Use of ELM in neural decoding for classifying movementtype and onset time of movement.

promoting training and also discuss scalability of this archi-tecture.

II. PROPOSEDDESIGN: ALGORITHM

A. Extreme Learning Machine

1) Network Architecture: The machine learning algorithmused in this design is Extreme Learning Machine (ELM) pro-posed in [19]. As depicted in Fig. 2 (a), The ELM is essentiallya single hidden-layer feed-forward network (SLFN). The k-thoutput of the network (1≤ k ≤ C) can be expressed as follows,

ok =L∑

i

βkig (wi,x, bi) =L∑

i

βkihi = hTβk,

wi,x ∈ RD;βki, bi ∈ R;h,βk ∈ RL

(1)

wherex denotes the input feature vector,L is the number ofhidden neurons,h is the output of the hidden layer,bi is thebias for each hidden layer node,wi and βki are input andoutput weights respectively. A non-linear activation functiong() is needed for non-linear classification. A special caseof the nonlinear function is the additive node defined byhi = g

(

wT

ix+ bi

)

. The above equation can be compactlywritten for all classes aso = hβ, o = [o1, o2, ..oc] whereβ = [β1,β2...βC ] denotes theL×C matrix of output weights.

While the outputok can be directly used in regression,for classification tasks the input is categorized as the k-thclass if ok is the largest output. Formally, we can definethe classification output as an integer class labels given bys = argmaxkok, 1 ≤ k ≤ C. Intuitively, we can think of thefirst layer as creating a set of random basis functions whilethe second layer chooses how to combine these functions tomatch a desired target. Of course, if we could choose the basisfunctions through training as well, we would need less numberof such functions. But the penalty to be paid is longer trainingtimes. More details about the algorithm can be found in [19],[20].

2) Training Methods: The special property of the ELMis that w can be random numbers from any continuousprobability distribution and remains unchanged after initiation[19], while only β needs to be trained and stored with highresolution. Therefore the training of this SLFN reduces tofinding a least-square solution ofβ given the desired target

values in a training set. We will next show two methods oftraining–the conventional one (T1) for improved generalizationas well as a second method (T2) that promotes sparsity. Forsimplicity, we show the solution of weights for one outputok–the same method can be extended to other output weightsas well and can be represented in a compact matrix equation[19].

Suppose there arep training samples–then we can create ap×L hidden layer output matrixH where each row ofH hasthe hidden layer neuron output for each training sample. LetTk ∈ Rp be the vector of target values for thep samples.With these inputs, the two training methods are shown inFig. 3. The step forL2 norm minimization can be solveddirectly with the solution given byβk = H

†Tk whereH

is the Moore-Penrose generalized inverse of the matrixH.Hence, training can happen quickly in this case. TheL1 normminimization step in T2 however has to be performed usingstandard optimization algorithms like LARS [21]. Thus T2provides reduced hardware complexity due to reduction in thenumber of hidden neurons at the cost of increased trainingtime.

B. Neural Decoding

The neural decoding algorithm we use is inspired by themethod in [22]. We replace the committee of ANN in theirwork with ELM in our case. Three specific advantages ofthe ELM for this application are (1) the fixed random inputweights can be realized by a current mirror array exploitingfabrication mismatch of the CMOS process; (2) one-steptraining that is necessary for quick weight update to addresschange in input statistics and (3) the hidden layer outputsh

can be reused for multiple operations on the same input datax. In this case, we have reusedh to classify both the onsettime and type of movement. One disadvantage of the ELMalgorithm is the usage of1.5− 3X hidden neurons comparedto fully tuned architectures (e.g. SVM, AdaBoost) since thehidden nodes in ELM only create random projections that arenot fine tuned [23]. However, implementing random weightsresults in more than10X savings over fully tunable weightsmaking this architecture more lucrative overall. Next, we givean overview of the decoding algorithm while the reader ispointed to [22] for more details.

1) Movement type and Onset time Decoding: Figure 2(b) depicts how the ELM is used in neural decoding. Eventhough the input is an asynchronous spike train, the ELMproduces classification outputs at a fixed rate of once everyTs seconds. The inputx is created from the firing rate ofspike trainsp(t) =

tsδ(t − ts) of biological neurons by

finding the moving average over a durationTw. Hence, wecan define the firing rateri of i-th neuron at time instanttkas ri(tk) =

∫ tk

tk−Twp(t)dt whereTs = 20 ms andTw = 100

ms following [22]. Finally,x(tk) = [r1(tk), r2(tk), ...rq(tk)]where there areq biological neurons in the recording (d = q).As shown in Fig. 2 (b), we haveC = M+1 output neurons inthis case where there areM movement types to be decoded.The M+1-th neuron is used to decode the onset time ofmovement. For decoding type of movement, we can directly

Page 3: A 128 channel Extreme Learning Machine based Neural ... · C] denotes the L×C matrix of output weights. While the output o k can be directly used in regression, for classification

3T1

H

S1: Create hidden layer matrix

for all p training samples

T2

LpR ´ÎHp

k RÎT

22min kkk

k

bgbb

+T-H

S2: Least square optimization with L2

regularization:

LpR ´ÎHp

k RÎT

S3: Prune the hidden layer neurons

with zero output weights

LpR¢´ÎH*

p

k RÎT

L

k RÎb

L

k RÎ*b

L

k R¢

Îb

H

S1: Create hidden layer matrix

for all p training samples

S2: Least square optimization with L1

regularization:

S4: Least square optimization with L2

regularization:

12min kkk

k

bgbb

+T-H

22

*min kkkk

bgbb

+T-H

Fig. 3: Training methods for ELM: T1 is the conventionally used training method to improve generalization by minimizing norm of weightsas well as training error. T2 uses an additional step of sparsifying output weights to reduce the required hardware.

use the method described in the earlier section II-A forM -class classifier to get the predicted output class at timetk ass(tk) = argmaxpop(tk), 1 ≤ p ≤ M .

For decoding movement onset time, we further create abinary classifier that reuses the same hidden layer but addsan extra output neuron. Similar to [22], this output is trainedfor regression–the target is a trapezoidal fuzzy membershipfunction which gradually rises from0 to 1 representing thegradual evolution of biological neural activity. This outputoM+1 is thresholded to produce the final outputG(tk) at timetk as:

G(tk) =

{

1, if oM+1(tk) > θ

0, otherwise(2)

whereθ is a threshold optimized as a hyperparameter. More-over, to reduce spurious classification and produce a contin-uous output, the primary outputG(tk) is processed to createGtrack(tk) that is high only ifG is high for at leastλ timesover the lastτ time points. Further, to reduce false positives,another detection is prohibited forTr ms after a valid one.The final decoded output,F (tk) is obtained by a simple com-bination of the two classifiers asF (tk) = Gtrack(tk)× s(tk).

2) Time delay based dimension increase (TDBDI): A com-mon problem in long-term neural recording is the loss ofinformation from electrodes over time due to tissue reactionssuch as gliosis, meningitis or mechanical failures [24]. Hence,initially functional electrodes may not provide informationlater on. To retain the quality of decoding, we propose amethod commonly used in time series analysis–the use ofinformation from earlier time points [25]. In the context ofneural decoding, it means that we use more information fromthe dynamics of neural activity in functional electrodes inplace of lost information from the instantaneous values of

activity in previously functional electrodes. So if we usep−1previous values from then functional electrodes, the newfeature vector is given by:

x(tk) = [r1(tk), r1(tk−1), r1(tk−2), r1(tk−p+1),

r2(tk), r2(tk−1)...rn(tk−p+1)](3)

where the input dimension of the ELM is given byD = n×p.This is a novel algorithmic feature in our work compared to[22].

III. PROPOSEDDESIGN: HARDWARE IMPLEMENTATION

Fig.1 shows a typical usage scenario for our MLCP whereit works in tandem with the DSP and performs the intentiondecoding. The DSP only needs to send very simple controlsignals to the MLCP and performs the calculation of thesecond stage of ELM (multiplication by learned weightsβ).The input to the MLCP comes from spike sorting that can beperformed on the DSP [10]. In some cases, spike sorting maynot be needed and spike detection may be the only requiredpre-processing [24].

A. Architecture

Details and timing of the MLCP are shown in Fig. 4. Wemap input and hidden layers of ELM into the MLCP fabricatedin AMS 0.35-µm CMOS process, where high computationefficiency is achieved by exploiting fabrication mismatchabundantly found in analog devices, while the output layerthat requires precision and tunability (tough to attain in analogdesigns) can be implemented on the DSP. Since the number ofcomputations in first stage far outnumbers those in the second(as long asD >> C), such a system partition still retainsthe power efficiency of the analog design. Up to128 input

Page 4: A 128 channel Extreme Learning Machine based Neural ... · C] denotes the L×C matrix of output weights. While the output o k can be directly used in regression, for classification

4

RN_cntSPK

A<6:0>

CCO1

CNT1

h1

1-to-128DeMux

WinCNT1 DAC1

Input ProcessingCurrent Mirror

Array

Reference

SPI

MLCP

CLK_out C<13:0>

Iin1

CCO2

h2

Iin2

CNT2

CCOL

hL

IinL

CNTL

WinCNT2 DAC2

WinCNTD DACDI DAC,2

I DAC,1

I DAC,D

Column Scanner

DSP Timing & ControlELM output

StageWireless T/Rx

NEU

SDL<2:0>

RN_in

CLK_in

Random weight

(wij)

Hidden layer output

Inputvector

(x)

CM

CM

CM

(a)

CLK_out

SPK

A<6:0>

RN_in

CLK_in

NEU

C<0:13>

RN_cnt

(b)

Fig. 4: The diagram (a) and the timing (b) of MLCP based neural decoder.

channels and128 hidden layer nodes are supported by theMLCP, with each input channel embedding an input processingcircuit that extracts input feature from the incoming spiketrains. As mentioned in the earlier section, we extract a movingaverage of the spike count as the input feature of interest.

On receiving a spike from the neural amplifier array (afterspike detection and/or spike sorting), the DSP sends a pulsevia SPK and7-bit channel address (A〈6 : 0〉) to the DEMUXin the MLCP for row-decoding. Each row of the MLCP has a6-bit window counter (WinCNT ) to count the total numberof input spikes in a moving window with length of5ts and amoving step ofts. The length ofts, normally set to20 ms, isdetermined by the period ofCLK in. The counter value inj-th row is converted into input feature currentIDACj for theELM, corresponding to the inputxj in Fig. 2. Furthermore, a1-bit control signal (Sext 〈1〉) stored in each row determineswhether the j-th row’s input to the moving window circuitis an external spike count or a delayed spike count fromthe previous channel. The delay length can be selected fromamong5 delay steps ranging from20 ms to100 ms, based on

SDL 〈2 : 0〉. This is how the TDBDI feature described earlieris implemented in the MLCP.

The input feature current from each row is further mirroredinto all hidden-layer nodes by a current mirror array. Hence,ratios of the current mirrors are essentially the input weights,and are inherently random due to fabrication mismatch of thetransistors even when identical value is used in the design.Weuse sub-threshold current mirror to achieve very low power

consumption, resulting inwij = e∆Vt,ij

UT with UT denotingthermal voltage and∆Vt,ij denoting the threshold voltagemismatch between input transistor on j-th row and mirrortransistor on i-th column of that row. This is similar to theconcepts described in [26] [27]. The input weights are log-normal distributed since∆Vt,ij follows a normal distribution.We therefore realize random input weights in a very low ‘cost’way that requires only one transistor per weight. It is the fixedrandom input weights of the ELM that makes this uniquedesign possible. A capacitanceCM = 400 fF on each rowsets the SNR of the mirroring to43 dB.

The hidden layer node is implemented by a current con-

Page 5: A 128 channel Extreme Learning Machine based Neural ... · C] denotes the L×C matrix of output weights. While the output o k can be directly used in regression, for classification

5

4b CNT 4b reg

6breg

Sext<1>=0Di<3:0>

MUX5to1

Do<3:0>

6b current

DAC

4b reg

4b reg

4b reg

4b reg

4b reg

DAC1WinCNT1

Di<3:0>

4b CNT 4b reg

6breg

Sext<2>=1

MUX5to1

Do<3:0>

6b current

DAC

4b reg

4b reg

4b reg

4b reg

4b reg

DAC2WinCNT2

I DAC2

I DAC1

SPK1 Dn

Dn-5

Qn-1

0

1

SDL<2:0>

SDL<2:0>

(a)

D5D5

_

IDAC

Iref

D4D4

_D0

D0

_DVDD

(b)

DVDD

vmem

Iin

NEU

Cint

M1

I leak

Cf

vfM2 M3

Vlk

vo

INV1 INV2 INV3

(c)

Fig. 5: Sub-block circuit diagrams: (a) Input processing circuit totake moving average of incoming spikes; (b) Current-mode DAC toconvert average value to inputx for the current mirror; (c) Neuron-based CCO to implement hidden node non-linearity and convert todigital.

trolled oscillator (CCO) driving a14-bit counter with a3-bit programmable stop valuefmax to implement a saturatingnonlinearity in the activation functiong(). The advantage ofchoosing this nonlinearity is that it can be digitally set andalso some neurons can be configured to be linear as wellto achieve good performance in linearly separable problems[28]. The computation of hidden layer nodes is activated bysettingNEU high. The output of CCO is a pulse frequencymodulated signal with the frequency proportional to total inputcurrent. The counter outputs are latched and serially readusing theCLK OUT signal whenNEU is low with CCOdisabled to save power. The output weights,β, are stored onthe DSP where the final outputok is calculated. Thus theMLCP performs the bulk of MACs (D×L) while the DSP only

MLCP

MCU

Wireless

T/Rx

5.1 cm

7.4cm

PEU Photo

Current

Mirror

Array

WindowCounter

Circuits

CCOs & Output

Counters

Reference

& SPI

4.95mm

4.95 mm

MLCP Die Photo

Fig. 6:Die photo and test board:The die photo of MLCP fabricatedin 0.35-µm CMOS process and the portable external unit (PEU)integrating MLCP with MCU and battery.

performs C×L MACs of the output layer. It should be notedhere, the output of hidden layer neurons changes with powersupply voltage due to sensitivity of the CCO frequency topower supply variation, leading to degradation of the decodingaccuracy. However, since power supply variation is a common-mode component to all CCOs, normalization methods can beapplied in post-processing (see Section IV-E4) to the hiddenlayer outputs to reduce the effect introduced by power supplyvariation.

B. Sub-circuit: Input processing

Fig. 5 shows diagrams of the circuit blocks in the MLCP.Fig. 5 (a) shows two adjacent input processing circuits withWinCNT1 configured to receive an external spike train bysettingSext 〈1〉 = 0 andWinCNT2 configured as time delaybased channel by settingSext 〈2〉 = 1. The correspondingsignal flows are also depicted in the figure by red dash lines.The moving window counter is realized by (1) counting spikein a sub-window in a length ofts; (2) storing sub-windowcounter value in a delay chain made of shift registers; and(3) adding and subtracting previous6-b output value withcorresponding sub-window counter values in the delay chainto get new6-b output value ofWinCNT . This calculationcan be represented as:

Qn 〈5 : 0〉 = Qn−1 〈5 : 0〉+Dn 〈3 : 0〉 −Dn−5 〈3 : 0〉 , (4)

whereQn 〈5 : 0〉 andDn 〈3 : 0〉 are6-b output value and4-bsub-window counter value at time instancen respectively. Allregisters in the input processing circuits toggle at the risingedge ofCLK in. The advantage of this structure is that thedelay chain for sub-window counter value is reused in theproposed TDBDI feature, leading to a compact design.

A compact,6-bit MOS ladder based current mode DAC,as shown in Fig. 5 (b), splits a reference currentIref (6-bitprogrammable in range of1 nA to 63 nA) according to theWinCNT output value to generate the input feature currentIDAC to the current mirrors.

C. Sub-circuit: Current Controlled Oscillator

The diagram of the current controlled oscillator (CCO) isdepicted in Fig. 5 (c). The capacitance ofCint = 400 fFsets oscillation frequency of this relaxation oscillator based

Page 6: A 128 channel Extreme Learning Machine based Neural ... · C] denotes the L×C matrix of output weights. While the output o k can be directly used in regression, for classification

6

173 174 1750

20

40

60

80

100

counter value

Median counter value: 174

710 711 712 713 7140

20

40

60

80

100

counter value

Median counter value: 712

2150 2151 2152 2153 2154 2155 2156 21570

10

20

30

40

50

60

70

80

90

100

counter value

Median counter value: 2154

μ=174

σ=0

μ=712.1

σ=0.48

μ=2153.6

σ=1.39

Fig. 8: Jitter performance: The variation in the counter output for a fixed value of input current is observed for 100 trials andplotted as a histogram for (a) low, (b) medium and (c) high input currents. The measured jitter is < 0.1%

CLK_in

Q<5>

Q<4>

Q<3>

Q<2>

Q<1>

Q<0>

CLK_in

40 ms 80 ms

14 28 43 57 63 Q<5:0> 14 28 43 57 63

Sext=0Sext=1SDL<2:0>=001, 40ms delay

CLK_in

Q<5>

Q<4>

Q<3>

Q<2>

Q<1>

Q<0>

CLK_in

0 200 400 6000

50

100

150

200

Input Frequency (Hz)

Ou

tpu

tF

req

uen

cy(k

Hz) 128 CCOs Transfer Curves

SPK

vmem

vo

(a)

(b) (c)

Fig. 7: MLCP circuit blocks measurement results: (a) TDBDIfeature; (b) waveform of CCO oscillation; (c) transfer curves of all128 CCOs.

on the summed input current whileCf = 100 fF provideshysteresis through positive feedback. WhenNEU is pulledhigh, pFET M2 is turned off.M1 is used to set the leakage termbi in equation 1 and can be set to0 for most cases.Iin fromthe current mirrors starts to dischargevmem until it crossesthe threshold voltage of the INV1, leading to transition of allinverters. Then,vmem is pulled down very quickly througha positive feedback loop formed byCf . At the same time,M3 turns on, chargingvmem towardsDVDD until it crossesthe threshold voltage of INV1 from low to high and the cyclerepeats. Neglecting higher order effects, the time for eachcycle

10 20 30 40 50 60−4

−2

0

2

4

DN

L (b

it)

Input Code

Fig. 9:DNL performance: DNL of 64 randomly selected input DACchannels show±3 LSB performance.

of the CCO operation is determined by the sum of the chargingand discharging time constant ofvmem, and can be expressedby:

TCCO =Cf ×DVDD

Iin+

Cf ×DVDD

(Irst − Iin), (5)

whereIrst is the charging current when M3 is on. NormallyIrst >> Iin reducing equation 5 to:

fCCO =1

TCCO

≈Iin

Cf ×DVDD. (6)

IV. M EASUREMENTRESULTS

A. MLCP Characterization

This section presents the measurement results from theMLCP fabricated in 0.35-µm CMOS process. To test thecircuit, we have integrated it with a microcontroller unit orMCU (TI MSP430F5529) to act as the DSP. Though we havenot integrated it with an implant yet, this setup does allowus to realistically assess performance of the MLCP with pre-recorded neural data as shown later. Moreover, the designedboard is entirely portable with its own power supply andwireless TX-RX module (TI CC2500). Hence, it can be usedas a portable external unit (PEU) for neural implant systemsas well. As shown in Fig. 6, the MLCP has a die area of4.95× 4.95 mm2 and the PEU measures7.4 cm × 5.1 cm.

Page 7: A 128 channel Extreme Learning Machine based Neural ... · C] denotes the L×C matrix of output weights. While the output o k can be directly used in regression, for classification

7

0 1 2 3 4 5 60

500

1000

1500

2000

2500

3000

Normalized Random Weights

Output Channel Address

Inp

ut

Ch

ann

el A

dd

ress

Output Frequency (kHz)

20 40 60 80 100 120

120

100

80

60

40

20100

200

300

-60 -40 -20 0 20 40 600

500

1000

1500

2000

2500

DVt (mV)

(a)

(b) (c)Fig. 10: The random input weights: (a) Measured mismatch mapof the CCO frequencies; (b) Distribution of input weights and (c)∆Vt,ij . These values are measured by reading the output countervalues when a fixed input value is given one row at a time.

TABLE I: Mean and standard deviation of∆Vt,ij

Chip No. µ (mV) δ (mV)1 0.188 16.22 0.132 16.93 -0.019 16.84 -0.105 17.25 0.004 16.56 0.535 16.47 0.276 17.68 -0.012 16.6

For the characterization results shown next, we useAVDD= 2.1 V powering the reference circuits to generate biascurrents andDVDD = 0.6 V for the rest. Figure 7(a) verifiesoperation of the input processing by probing output of thewindow counter, with frequency ofCLK in and input spiketrain being20 Hz and 630 Hz respectively. The output, aslabeled byQ 〈5 : 0〉, increases from0 to 63 within 100 msin the left half of Fig. 7 (a). The TDBDI feature is shownin the right half of Fig. 7 based on setting ofSDL 〈2 : 0〉 =001 whenSext = 1. It adds a delay of40 ms to theQ 〈5 : 0〉,comparing with waveforms in the left half. Measured chargingand discharging dynamics of the CCO based neuron are shownin Fig. 7 (b) by probing a buffered version of membranevoltagevmem. Measured transfer curves of the128 CCOs in achip is plotted in Fig. 7 (c), by varying input spike frequencyfrom 0 to 630 Hz. Here, the saturation of the count is notshown–when implemented, it stops the count at the presetvalue. The noise of the whole circuit is also characterized in

terms of jitter at the output of the CCO. The variance in thecounter value is measured for the same input current over100trials. This experiment is repeated for three different currentvalues spanning most of the counting range. The results of thisexperiment, shown in Fig. 8, demonstrate percentage jitterlessthan0.1% for the entire counting range.

Next, we show characterization results for the input DACchannels. Since it is not possible to separately measure outputcurrent of the DAC, we measure the output of the CCO toinfer the linearity of the DAC. This is reasonable since thelinearity and the noise performance of the CCO is better thanthe 6 bit resolution of the DAC. Figure 9 plots the measureddifferential non-linearity (DNL) of64 randomly selected inputDACs. The worst case DNL is approximately±3 LSB. Whilethis DNL can be part of the non-linearityg(wi,x,bi) in thegeneral case, it makes the implementation of the additive nodeless accurate.

Variation in transfer curves of the CCO array is a result ofrandom mismatch from various aspects of the circuits, mainlycurrent mirror array, which is expected and desired in thisdesign. By applying the same input spike frequency of320Hz to each row individually, a mismatch map of the CCOfrequencies is generated withIref = 32 nA, as presented inFig. 10 (a), by reading out the quantized frequency values inthe output counters. These frequencies are normalized to themedian frequency and plotted in Fig. 10 (b) and (c) to showconformance to the log-normal distribution as expected. Theunderlying random variable of∆Vt,ij has a normal distributionwith mean≈ 0 and standard deviation≈ 16.5 mV. Totallyeight sample dies are characterized with mean and standarddeviation of∆Vt,ij in all chips listed in Tab. I.

B. Experiment

The neural data used to verify the decoding performance ofthe proposed design is acquired in a monkey finger movementexperiment described in detail in [22]. In the experiment, themonkey puts its right hand into a pistol-grip manipulandumwith one finger placed in one slot of the manipulandum. Themonkey is trained to perform flexion or extension of the indi-vidual finger and wrist according to given visual instruction. Asingle-unit recording device is implanted into the M1 cortex ofthe monkey, enabling real-time recording of single unit spiketrain during the experiment. The entire data set includes neuraldata recorded from three different monkeys–Monkey C, G andK, performing 12 types of individuated movements labeled bythe moving finger number and by the first letter of the movingdirection. Furthermore, all the trials are aligned such that theonset of the movement happens at1 s. Therefore the ELM canbe trained according to the given label and the onset moment.

C. Neural Decoding Performance

We have tested the MLCP based PEU using the data setmentioned above. A multiple-output ELM with number ofclassesC = 12 is trained to identify the movement type of thetrial. An additional output is used to decode the onset time ofmovement. During training, the pre-recorded input spikes frombiological neurons in M1 are sent to the MLCP the counter

Page 8: A 128 channel Extreme Learning Machine based Neural ... · C] denotes the L×C matrix of output weights. While the output o k can be directly used in regression, for classification

8

5 10 15 20 25 30 35 400.5

0.6

0.7

0.8

0.9

1

Number Of Input Neurons

Deco

din

g A

ccu

racy

Chip 1

Chip 2

Chip 3

Chip 4

Chip 5

Chip 6

Chip 7

Chip 8

0 20 40 60 80 100 120

0.4

0.6

0.8

1

Number of Hidden Layer Neurons

Test

ing A

ccura

cy

0.948

DecodingAccuracy

Number Of HIdden Layer Nodes (L)

5 10 15 20 25 30 35 400.5

0.6

0.7

0.8

0.9

1

Number Of Input Neurons

Decodin

g A

ccura

cy

Monkey K (indiv)

Monkey K (comb)

Monkey C

Monkey G

Number Of M1 Neurons (n)

DecodingAccuracy

(a)

(c)Number Of M1 Neurons (n)

DecodingAccuracy

(b)

(d)

5 10 15 20 25 30 35 400.5

0.6

0.7

0.8

0.9

1

Number Of Neurons

Dec

od

ing

Acc

ura

cy

without delayed samples

with delayed samples

0.985

0.9600.917

0.854

0.993

Number Of M1 Neurons (n)

DecodingAccuracy

Fig. 11: Measured movement types decoding performance:(a) Decoding accuracy versus number of hidden layer nodes; (b) Decodingaccuracy versus number of M1 neurons (with/without TDBDI);(c) Decoding accuracy across monkeys; (d) Decoding accuracy across 8 dies;

values ofH are wirelessly transmitted to a PC wherefmax

and β are calculated and communicated back. This processalready includes non-idealities in the analog processor suchas DNL of input DAC, non-linearity in CCO and early effectinduced current addition errors–hence, the learning takestheseeffects into account and corrects for them appropriately. Then,the MLCP can run autonomously during testing phase.

We present decoding results in a format similar to [22]for easy comparison wherever possible. For the first set ofexperiments, we use the normal training method T1 describedin section II-A2. As shown in Fig. 11 (a) withn = D = 30, thedecoding accuracy of the12 types of movements (the flexionand extension of the fingers and wrist in one hand) increasesasL is increased, with a mean accuracy of94.8% at L = 60.This trend is expected [19] since more number of randomprojections should allow better separation of the classes tillthe point when the amount of extra information for a newprojection is negligible. Based on this result, we fixL = 60for the rest of the experiments unless stated otherwise.

Next, we explore the variation in performance as numberof available neural channels (n) (or equivalently M1 neuronsin this case) reduces while keepingL fixed at 60. Fig. 11(b) shows that an increase in accuracy from85.4% to 91.7%can be obtained atn = 15, by using delayed samples asadded features (TDBDI). Here, we have used only one earliersample–hence,p = 2 and the effective input dimension of theELM is D = 2 × n. With n = 40, L = 60 and p = 2, adecoding accuracy of99.3% can be achieved. Next, to checkthe robustness of the earlier result, the same experiment isper-formed using several different datasets, including individuatedfinger movement data from Monkey K, C and G and combined

ELM Onset

Decoding

M1 Spikes

Primary output

Thresholding

(θ)

Y

NSequential l

Positives?

Refractory

Period (Tr)

Post-process output

oM+1(tk)

G(tk)

Gtrack(tk)

Fig. 12: Flow chart describing the finite state machine on DSPto calculate Gtrack from G.

finger movement from Monkey K (12 individuate movementsand 6 types of simultaneous movement of two fingers). Theresults of the MLCP with increasing M1 neurons, as shown inFig. 11 (c), is consistent with software result in [22]. The trendof increasing performance with more M1 neurons is expectedsince it provides more information. The performance of theproposed MLCP is also robust across eight sample chips, aspresented in Fig. 11 (d) for the same experiment as in the last

Page 9: A 128 channel Extreme Learning Machine based Neural ... · C] denotes the L×C matrix of output weights. While the output o k can be directly used in regression, for classification

9

0 1 2 3 4 5 60

10

20

30

40

f4 e4 f1

Time (s)

M1 n

euro

ns

spik

es

Predicted

Move Type

Predicted Onset

Moment primary output

Threshold

Post-process

outputTraining

Target OnsetOnsetOnset

(a)

0 0.02 0.04 0.06 0.08 0.1 0.120

0.2

0.4

0.6

0.8

1

False Positive Rate

Tru

eP

osi

tive

Rat

e

l =2, Tr=0 msl =6, Tr=0 msl =2, Tr=140 msl =6, Tr=140 msθ=1.0

θ=0.6

θ=0.3

(b)

Fig. 13:Measured movement onset decoding results:(a) A segmentof 40 channel input spike trains is shown with real-time decodingoutput deciding when a movement onset happens and which tpyeisthis onset. (b) ROC curves of onset decoding.

0 20 40 60 80 100 120

0.4

0.6

0.8

1

Number of Hidden Layer Neurons (L)

Dec

odin

g A

ccur

acy

T1

T2

Fig. 14: Advantage of Sparsity promoting training T2: Thesparsity promoting method chooses best random projectionsand canreduce required number of hidden neurons by around50%.

two cases.

The hidden layer output matrixH is reused to decode theonset time of finger movement using the regression capacityof the ELM. As mentioned earlier, only one more output nodeis added to the ELM. The trapezoidal membership functiondescribed in section II-B and shown in Fig. 13 (a) is set to1around the time of1 s to indicate the onset and set to0 wherethere is definitely no movement. Figure 12 illustrates the finitestate machine in the MCU to implement the post-processingdescribed in section II-B to obtainGtrack from the primaryoutputG. Optimal values ofλ = 6 andTr = 140 ms can befound from the ROC curve shown in Fig. 13 (b). The nature

Power Breakup

Current reference

88.4nW

128 input DACs

271.6 nW

CCOs & current

mirrors

54 nW

AVDD

DVDD

Fig. 15: Power breakup: Power dissipation in the MLCP is domi-nated by fixed analog power consumption of360 nW compared tothe power of54 nW dissipated fromDVDD in CCO and counter.

of the ROC curves are again very similar to the ones in [22].With H reused, we achieve real-time combined decoding bydetecting when there is a movement in the trial and labeling thepredicted movement type when a movement onset is detected.This is illustrated by a snapshot of the developed GUI inFig. 13 (a), where three2-s trials are shown with40-channelinput spike trains recorded from M1 region printed at thebottom part of the figure. Primary, post-processed output andpredicted movement type are also shown in the top half of thefigure. Lastly we show the benefits of the sparsity promotingtraining method, T2 described in section II-A2. To show thebenefit of this method, we compare with the first experimentshown earlier in Fig. 11 (a) wheren = D = 30 and thenumber of hidden layer neuronsL is varied to see its effecton performance. It can be seen that for the method T2, thedecoding accuracy increases to approximately the maximumvalue of 94.8% attained by the method T1 for much fewernumber of hidden layer neurons (L ≈ 30). This is possiblebecause the sparsity promoting step of minimizingL1 norm ofoutput weights chooses the most relevant random projectionsin the hidden layer. Thus, the new method T2 can reduce powerdissipation by approximately50% due to reduction in numberof hidden layer neurons.

D. Power Dissipation

Finally, we report the power consumption of the proposedMLCP for the40 input channels,60 hidden layer nodes,12-class classification problem. The current drawn from analogand digital power supply pins were measured using a Keithleypicoammeter. The power breakup is shown in Fig. 15. At thelowest value ofAVDD = 1.2 V andDVDD = 0.6 V neededfor robust operation, the total power dissipated is414 nW with54 nW from DVDD and360 nW from AVDD. Performing40 × 60 MAC in the current mirror array at50 Hz rate ofclassification, the MLCP provides a3.45 pJ/MAC and 8.3nJ/classify performance. It is clear that the efficiency is limitedby the fixed analog power that is amortized across theLhidden layer neurons andD×L current mirror multipliers. Thefundamental limit of this architecture is the power dissipationof the CCO and current mirror array which is limited to0.45pJ/MAC.

Page 10: A 128 channel Extreme Learning Machine based Neural ... · C] denotes the L×C matrix of output weights. While the output o k can be directly used in regression, for classification

10

TABLE II: Comparison Table

JSSC 2013 [29] JSSC 2007 [30] JSSC 2013 [31] ISSCC 2014 [32] This WorkTechnology 0.13µm 0.5 µm 0.13µm 0.13µm 0.35µm

Supply voltage 0.85 V 4 V 1.2 V (digital) 3 V 0.6 V (digital)1 V (analog) 1.2 V (analog)

Design style Digital Analog floating Mixed mode Analog floating Mixed modegate gate

Algorithm SVM SVM Fuzzy logic Deep learning feature ELM feature with TDBDIApplication EEG/ECG analysis Speech Recognition Image processing Autonomous sensing Neural Implant

Power dissipation 136.5µW 0.84µW 57 mW 11.4µW 0.4 µWMax input dimension 4001 14 14 8 1281

Energy efficiency 631 pJ/MAC2 0.8 pJ/MAC 1.4 pJ/MAC3 1 pJ/op4 5.2/1.46 pJ/MAC5

Resolution 16 b 4.5 b 6 b 7.5 b 7/14 b6

Classification rate 0.5-2 Hz 40 Hz 5 MHz 8.3 kHz 50 Hz1 can be further extended by reusing input channels at the expense of classification rate2 assuming 1000 support vectors3 1024 6-bit multiply at 10 MHz consumes 14 mW.4 The operations are much simpler than a MAC.5 5.2 pJ/MAC includes both analog and digital power forD = 40, L = 60 andC = 12. In reality, analog power is amortized across all multiplies

and the peak efficiency of1.46 pJ/MAC is attainable forD = L = 128 for the same value ofC. See section IV-D for details.6 Each multiply is 7 bit accurate due to SNR limitation while the output quantization in the CCO-ADC has 14 bits for dynamic range.

In contrast, recently reported16-bit digital multipliers con-sume 16-70 pJ/MAC [33] [34] [35] [36] where we ignore thepower consumed by the adder for simplicity. We have alsoimplemented near threshold digital array multipliers in65nmCMOS operating at0.65 V that resulted in energy efficiencyof 11 pJ/MAC confirming the much lower energy attainableby analog solutions over digital ones. Moreover, implementingthe MLCP computations in digital domain would incur furtherenergy cost due to memory access (for getting the weightvalues) and clocking which are ignored here.

Since we implement the operation of second stage in digitaldomain, we needC × L multiplications per classification.For the case ofL = 60 and C = 12 described above andenergy cost of11 pJ/MAC for digital multiplies, the totalenergy cost of second stage operation is7.92 nJ/classify.Hence, the total energy/classification becomes16.22 nJ andthe combined energy/operation increases to5.2 pJ/MAC. Forpeak energy efficiency, we considerD = 128, L = 128 andC = 12 resulting in a net energy/computation of1.46 pJ/MACincluding both stages.

E. Discussion

1) Comparison: Our MLCP is compared with other re-cently reported machine learning systems in Table II. Com-pared to the digital implementation of SVM in [29], ourimplementation achieves far less energy per MAC due tothe analog implementation. [30], [31] and [32] achieve goodenergy efficiency similar to our method by using analogcomputing. [31] uses a multiplying DAC (MDAC) to performthe multiplication by weights–however, they have only6 bitresolution in the multiply and also the MDAC occupies muchlarger area than the single transistor we use for multiplications.[30] and [32] use analog floating-gate transistors for the mul-tiplication. Compared to these, our single transistor multipliertakes lesser area (no capacitors that are needed in floating-gates), does not require high voltages for programming chargeand allows digital correction of errors because of the digitaloutput.

WinCNT

WinCNT

CC

OC

NT

CC

OC

NT

25 μm

24 μm

0.4×0.35 μm2

Fig. 16: Array Layout: The area of the current IC is limited by thepitch of the CCO and WINCNT circuits even though the actual areaof the current mirrors (0.4× 0.35 µm2) are very small.

2) Area Limits: Using a single transistor for multiplicationin the first layer should provide area benefits over otherschemes. The current layout (Fig. 16) was done due to itssimple connection patterns and is not optimized. It can be seenthat the actual area of a unit transistor in the array (0.4× 0.35µm2) is much less than the area of an unit cell in the layoutwhich is limited by the pitch of the CCO and the windowcounter circuits. Moving to a highly scaled process or foldingthe placement of the output CCO layer to be parallel to theinput window counter circuits would enable large reduction(≈ 80X) in the area of the current mirror array. The ultimatelimit in terms of area for this architecture stems from the areaof capacitors–for this128 input, 128 output architecture, thetotal capacitor area is0.132 mm2.

3) Data rate requirements: When used in an implant withoffline training, the MLCP can reduce transmission data ratedrastically. Firstly, for direct transmission of100 channel datasampled at20 kHz with 10 bit resolution, required data rateis 20 Mbps. This massive data rate can be reduced partiallyby including spike sorting [11]. In this case, assuming8bit address encoding a maximum of256 biological neuronseach firing at a ratefbio, the data rate to be transmitted fora conventional implant without neural decoder is given byRconv = 8 × 256 × fbio. As an example, withfbio = 100

Page 11: A 128 channel Extreme Learning Machine based Neural ... · C] denotes the L×C matrix of output weights. While the output o k can be directly used in regression, for classification

11

Hz, Rconv = 204 kbps. This can be reduced even further byintegrating the decoder as proposed here. For the proposedcase, the output of the decoder is obtained at a ratefdeco.During regular operation after training, the data rate forCclasses is given byRprop,test = fdeco×⌈log2(C)⌉. As an ex-ample, for the case described in section IV-C withfdeco = 50Hz andC = 13, Rprop,test = 200 bps. This example, showsthe potential for thousand fold data rate reductions over spikesorting by integrating the decoder in the implant.

From the viewpoint of power dissipation, the analog frontend and spike detection can be accomplished within a powerbudget of1 µW per channel [37] [38] [5]. Assuming a trans-mission energy of≈ 50 pJ/bit from recently reported wirelesstransmitters for implants [39]–[41], the power dissipation forraw data rates of200 kbps/channel and compressed data ratesof 2 kbps/channel after spike sorting are10 µW and0.1 µWrespectively. Hence, the power for wireless transmission isa bottleneck for systems transmitting raw data. For systemswith spike sorting in the implant, this power dissipation isnot a bottleneck. However, the power/channel needed for thespike sorter is about5 µW. In comparison, if our decoderoperates directly on the spike detector output, it can providecompression at a power budget of< 0.01 µW/channel. Thiswould result in a total power dissipation/channel of≈ 1 µWin our case compared to≈ 6 µW in the case of spike sorting–a 6X reduction. There is a lot of evidence that the decodingalgorithms can work on the spike detector output [24]; in fact,it is believed that this will make the system more robust forlong term use. This will be a subject of our future studies.

Even if the decoder is explanted, a MCU cannot providesufficient throughput to support advanced decoding algorithmswhile FPGA based systems consume a large baseline power. Acustom MLCP based solution provides an attractive combina-tion of low-power and high throughput operation when pairedwith a flexible MCU for control flow.

4) Normalization for Increased Robustness: The variationof temperature is not a big concern in the case of implantableelectronics since body temperature is well regulated. However,variation of power supply voltage can be a concern. A normal-ization method can be applied to the hidden layer output forreducing its variation due to power supply fluctuation, at thecost of additional computation. The normalization proposedhere can be expressed by:

hj,norm =hj

∑L

j=0hj/

∑D

i=0xi

. (7)

The rationale behind the proposed normalization is that theeffect of power supply fluctuation on the hidden layer outputcan be modelled as multiplication factor in hidden layer outputequation. As analyzed before, the output of thejth hiddenlayer node can be formulated as:hj =

Iin,j

Cf×V DDtcnt, where

Iin,j is the input current of thejth hidden layer node andtcnt is counting window length. SinceIin,j is proportional tothe strength of input vectorx = [x1, x2...xD], we can modelthe relation between the input vector and hidden layer outputas: hj = Kjα(T, V DD)

∑D

i=0xi, where the variation part

is a multiplicative termα(T, V DD), andKj lumps up theconstant part of the path gain from input tojth hidden layer

DVDD (V)2.5

2.5

3.5

4.5

1 2

DVDD (V)1 1.5 2 2.5

2.5

3

3.5

4

4.5x=8

(a) (b)

x=10

150

200

250

300

350

400

450

500

550

600

650

h i

Norm

aliz

edh i

150

200

250

300

350

400

450

500

550

600

650

h i

Norm

aliz

edhi

1 21 1.5 2

3

4

hiNormalized hi

hiNormalized hi

Blue lines are original hidden layer output from SPICE simu-lation, while green dashed lines are normalized output in both(a) and (b). The input x in (a) and (b) are 8 and 10 respectively.

Fig. 17: Normalization to reduce variation:

output. It is reasonable to assume thatα(T, V DD) is the sameacross different nodes, since fluctuation of power supply isaglobal effect on chip scale. Hence, it can be cancelled by theproposed normalization as:

hj,norm =hj

∑Lj=0

hj/∑D

i=0xi

=Kjα(T, V DD)

∑D

i=0xi

∑Lj=0

(Kjα(T, V DD)∑D

i=0xi)/

∑Di=0

xi

=Kj

∑L

j=0Kj

D∑

i=0

xi.

(8)

Simulation results are presented here to verify the proposedmethod of normalization. The original hidden layer outputs(L = 3) are obtained by SPICE simulations whereDVDDis swept from0.6 V to 2.5 V and inputx (D = 1) changesfrom 8 to 10. Original and normalized values of one of thehidden layer outputs are compared in Fig. 17. As can beobserved here, the normalized output (in green dashed lines)varies significantly less due to variation ofDVDD than theoriginal output (in blue solid lines). The hardware cost forthisnormalization isD + L additions andL divisions. Assumingsimilar costs for division and multiplication, the normalizationdoes not incur much overhead ifC >> 1 since L × Cmultiplications are required by the second stage anyway.

5) Considerations for Long Term Implants: When usingthis MLCP based decoder in long term implants, we haveto consider issues of parameter drift over several time scales.Over long term of days, aging of the circuits in MLCP or probeimpedance change due to gliosis and scarring may changeperformance. This is typically countered by retraining thedecoder every day [24]. Such retraining has allowed decodersto operate at similar level of performance over years. Overshorter time scales, any variation not sufficiently quenched bythe normalization method described earlier can be explicitlycalibrated by having digital multiplication of coefficients forevery input and output channel. These can be determinedperiodically by injecting calibration inputs and observing theoutput of the CCO.

Another type of training–referred to as decoder retraining[42], [43] are needed to take into account change in neuralstatistics during closed loop experiments. The training donehere may be thought of as open loop training for initialization

Page 12: A 128 channel Extreme Learning Machine based Neural ... · C] denotes the L×C matrix of output weights. While the output o k can be directly used in regression, for classification

12

of coefficents of second stage of ELM. Next, the experimenthas to be redone with closed loop feedback and new trainingdata set has to be generated for retraining the second layerweights. After several such iterations, the final set of weightsof second layer will be obtained.

V. CONCLUSION

We presented a MLCP in0.35-µm CMOS with a die areaof 4.95× 4.95mm2 and a 7.4 cm× 5.1 cm PEU based on theproposed MLCP that achieves real-time motor intention decod-ing in an efficient way. Implementing the ELM algorithm, theMLCP utilizes massive parallel low power analog computingand hardware reuse, achieving a power consumption of0.4 µWat 50 Hz classification rate, resulting in an energy efficiency of3.45 pJ/MAC. Learning in the second stage also compensatesfor non-idealities in the analog processor. Furthermore, It in-cludes time-delayed sample based dimension increase featurefor enhancing decoding performance when number of recordedneurons are limited. A sparsity promoting training method isshown to reduce the number of hidden layer neurons andoutput weights by≈ 50%. We demonstrated the operationof the IC for decoding individuated finger movements usingrecordings of M1 neurons. However, the ELM algorithm usedin the decoder is quite general and has been shown to be anuniversal approximator and equivalent to SVM or multi-layerperceptrons [20]. Hence, our MLCP can also be used for otherdecoding applications requiring regression or classificationcomputations. Higher dimensions of inputs and hidden layerscan be handled by making a larger IC and also by reusingthe same hidden layer several times. In either case, powerdissipation increases but not energy/compute. Higher inputdimensions can be accommodated at same power by reducingthe bias current input of the splitter DACs in input channels[27]. Increase of hidden layer neurons however do incur aproportional power increase. Given that the power requirementof the current decoder is> 100X lower than the AFE, wecan easily extend it to handle many more input and outputchannels.

VI. A CKNOWLEDGEMENT

The authors would like to thank Dr. Nitish Thakor forproviding neural recording data.

REFERENCES

[1] R. R. Harrison, P. T. Watkins, R. J. Kier, R. O. Lovejoy, D.J. Black,B. Greger, and F. Solzbacher, “A low-power integrated circuits for awireless 100-electrode neural recording system,”IEEE Journal of Solid-State Circuits, vol. 42, no. 1, pp. 123–133, Jan. 2007.

[2] R. Sarpeshkar, W. Wattanapanitch, S. K. Arfin, B. I. Rapoport, S. Man-dal, Michael W. Baker, M. S. Fee, S. Musallam, and R. A. Adersen,“Low-power circuits for brain-machine interfaces,”IEEE Transactionson Biomedical Circuits and Systems, vol. 2, no. 3, pp. 173–183, Sept.2008.

[3] F. Shahrokhi, K. Abdelhalim, D. Serletis, P. L. Carlen, and R. Genov,“The 128-channel fully differnetial digital integrated neural recordingand stimulation interface,”IEEE Transactions on Biomedical Circuitsand Systems, vol. 4, no. 3, pp. 149–161, Jun. 2010.

[4] Y. Chen, A. Basu, L. Liu, X. Zou, R. Rajkumar, G. Dawe, and M. Je, “ADigitally Assisted, Signal Folding Neural Recording Amplifier,” IEEETransactions on Biomedical Circuits and Systems, vol. 8, no. 8, pp.528–542, August 2014.

[5] E. Yao and A. Basu, “A 1 V, Compact, Current-Mode Neural SpikeDetector with Detection Probability Estimator in 65 nm CMOS,” inIEEE ISCAS, May 2015.

[6] J. Holleman, A. Mishra, C. Diorio, and B. Otis, “A micro-power neuralspike detector and feature extractor in .13um CMOS,” inIEEE CustomIntegrated Circuits Conference (CICC), 2008.

[7] L. Hoang, Y. Zhi, and W. Liu, “VLSI architecture of NEO spike detec-tion with noise shaping filter and feature extraction using informativesamples,” inIEEE EMBC, Sept. 2009, pp. 978–981.

[8] B. Gosselin and M. Sawan, “An Ultra Low-Power CMOS AutomaticAction Potential Detector,”IEEE Transactions on Neural Systems andRehabilitation Engineering, vol. 17, no. 4, pp. 346–353, Aug. 2009.

[9] T. Chen, K. Chen, Z. Yang, K. Cockerham, and W. Liu, “A biomedicalmultiprocessor SoC for closed-loop neuroprosthetic applications,” inSolid-State Circuits Conference - Digest of Technical Papers, 2009.ISSCC 2009. IEEE International, Feb 2009, pp. 434–435,435a.

[10] V. Karkare, S. Gibson, and D. Markovic, “A 130-uW, 64-Channel NeuralSpike-Sorting DSP chip,”IEEE Journal of Solid-State Circuits, vol. 46,no. 5, pp. 1214–22, May 2011.

[11] V. Karkare, S. Gibson, and D. Markovic, “A 75-uW, 16-Channel NeuralSpike-Sorting Processor With Unsupervised Clustering,”IEEE Journalof Solid-State Circuits, vol. 48, no. 9, pp. 2230–8, Sept 2013.

[12] S. Acharya, F. Tenore, V. Aggarwal, R. Etienne-Cummings, M. Schieber,and N. Thakor, “Decoding individuated finger movements using volume-constrained neuronal ensembles in the M1 hand area,”IEEE Transac-tions on Neural Systems and Rehabilitation Engineering, vol. 16, pp.15–23, 2008.

[13] P. Ifft, S. Shokur, Z. Li, M. Lebedev, and M. Nicolelis, “A brain-machineinterface enables bimanual arm movements in monkeys,”Science:Translational Medicine, vol. 5, pp. 1–13, 2013.

[14] L. Hochberg, D. Bacher, B. Jarosiewicz, N. Masse, J. Simeral, J. Vogel,S. Haddain, J. Liu, S. Cash, P. der Smagt, and J. Donoghue, “Reachand grasp by people with tetraplegia using a neurally controlled roboticarm,” Nature, vol. 485, pp. 372–375, 2012.

[15] I. H. Stevenson and K. P. Kording, “How advances in neural recordingaffect data analysis,”Nature Neuroscience, vol. 14, pp. 139–142, 2011.

[16] B. Rapoport, W. Wattanapanitch, H. Penagos, S. Musallam, R. Andersen,and R. Sarpeshkar, “A biomimetic adaptive algorithm and low-powerarchitecture for implantable neural decoders,” in31st Annual Interna-tional Conference of the IEEE EMBS, 2009.

[17] B. Rapoport, L. Turicchian, W. Wattanapanitch, T. Davidson, andR. Sarpeshkar, “Efficient universal computing architectures for decodingneural activity,” PLoS ONE, vol. 7, pp. e42492, 2012.

[18] Y. Chen, Y. Enyi, and A. Basu, “A 128 Channel 290 GMACs/W MachineLearning Based Co-Processor for Intention Decoding in Brain MachineInterfaces,” inIEEE ISCAS, May 2015.

[19] G. B. Huang, Q. Y. Zhu, and C. K. Siew, “Extreme Learning Machines:Theory and Applications,”Neurocomputing, vol. 70, pp. 489–501, 2006.

[20] G. Huang, H. Zhou, X. Ding, and R. Zhang, “Extreme learningmachine for regression and multiclass classification,”Systems, Man,and Cybernetics, Part B: Cybernetics, IEEE Transactions on, vol. 42,no. 2, pp. 513–529, April 2012.

[21] Bradley Efron, Trevor Hastie, Iain Johnstone, and Robert Tibshirani,“Least angle regression,”The Annals of Statistics, vol. 32, no. 2, pp.407–499, 2004.

[22] V. Aggarwal, S. Acharya, F. Tenore, H. Shin, R. Etienne-Cummings,M. Schieber, and N. Thakor, “Asynchronous decoding of dexterousfinger movements using M1 neurons,”IEEE Transactions on NeuralSystems and Rehabilitation Engineering, vol. 16, pp. 3–14, 2008.

[23] A. Rahimi and B. Recht, “Weighted Sums of Random KitchenSinks:Replacing minimization with randomization in learning,” in Proceedingsof Neural Information Processing Systems (NIPS), 2009.

[24] J.C. Kao, S.D. Stavisky, D. Sussillo, P. Nuyujukian, and K.V. Shenoy,“Information Systems Opportunities in Brain-Machine Interface De-codersInformation Systems Opportunities in Brain-Machine InterfaceDecoders,” Proceedings of the IEEE, vol. 102, no. 5, pp. 666–682,May 2014.

[25] A. Grigorievskiy, Y. Miche, A. Ventela, E. Severin, andA. Lendasse,“Long-term time series pprediction using OP-ELM,”Neural Networks,vol. 51, pp. 50–56, March 2014.

[26] A. Basu, S. Shuo, H. Zhou, M. Lim, and G. Huang, “Silicon SpikingNeurons for Hardware Implementation of Extreme Learning Machines,”Neurocomputing, vol. 102, pp. 125–134, 2013.

[27] Y. Enyi, S. Hussain, and A. Basu, “Computation using Mismatch:Neuromorphic Extreme Learning Machines,” inIEEE BiomedicalCircuits and Systems Conference (BioCAS), Rotterdam, 2013, pp. 294–7.

Page 13: A 128 channel Extreme Learning Machine based Neural ... · C] denotes the L×C matrix of output weights. While the output o k can be directly used in regression, for classification

13

[28] Yoan Miche, A. Sorjamaa, P. Bas, O. Simula, C. Jutten, and A. Lendasse,“Op-elm: Optimally pruned extreme learning machine,”Neural Net-works, IEEE Transactions on, vol. 21, no. 1, pp. 158–162, Jan 2010.

[29] Kyong Ho Lee and N. Verma, “A low-power processor with configurableembedded machine-learning accelerators for high-order and adaptiveanalysis of medical-sensor signals,”Solid-State Circuits, IEEE Journalof, vol. 48, no. 7, pp. 1625–1637, July 2013.

[30] S. Chakrabartty and G. Cauwenberghs, “Sub-microwatt analog vlsitrainable pattern classifier,”Solid-State Circuits, IEEE Journal of, vol.42, no. 5, pp. 1169–1179, May 2007.

[31] Jinwook Oh, Gyeonghoon Kim, Byeong-Gyu Nam, and Hoi-Jun Yoo, “A57 mw 12.5 uj/epoch embedded mixed-mode neuro-fuzzy processor formobile real-time object recognition,”Solid-State Circuits, IEEE Journalof, vol. 48, no. 11, pp. 2894–2907, Nov 2013.

[32] J. Lu, S. Young, I. Arel, and J. Holleman, “1 TOPS/W AnalogDeep Machine-Learning Engine with Floating-Gate Storage in 0.13umCMOS,” in ISSCC Dig. Tech. Papers, 2014, pp. 504–5.

[33] Y. He and C-H. Chang, “A New Redundant Binary Booth Encodingfor Fast 2n-Bit Multiplier Design,”IEEE Transactions on Circuits andSystems-I, vol. 56, no. 6, pp. 1192–1201, June 2009.

[34] K-S Chong, B-H Gwee, and J. S. Chang, “A Micropower Low-VoltageMultiplier With Reduced Spurious Switching,”IEEE Transactions onVLSI, vol. 13, no. 2, pp. 255–65, 2005.

[35] M. La Guia de Solaz and R. Conway, “Razor Based ProgrammableTruncated Multiply and Accumulate, Energy-Reduction for EfficientDigital Signal Processing,”IEEE Transactions on VLSI, vol. 23, no.1, pp. 189–93, Jan 2015.

[36] N. Petra, D. De Caro, V. Garofalo, E. Napoli, and A.G.M. Strollo,“Truncated Binary Multipliers With Variable Correction and MinimumMean Square Error,”IEEE Transactions on Circuits and Systems-I, vol.57, no. 6, pp. 1312–25, Jun 2010.

[37] D. Han, Y. Zheng, R. Rajkumar, G. Dawe, and M. Je, “A 0.45V100-channel neural-recording IC with sub-uW/channel consumption in0.18um CMOS,” inIEEE International Solid-State Circuits Conference,2013, pp. 290–291.

[38] Y. Enyi, Y. Chen, and Arindam Basu, “A 0.7 V, 40 nW Compact,Current-Mode Neural Spike Detector in 65 nm CMOS,”IEEE Trans-actions on Biomedical Circuits and Systems, Early Access 2015.

[39] J. Tan, W. S. Liu, C. H. Heng, and Y. Lian, “A 2.4 GHz ULPreconfigurable asymmetric transceiver for single-chip wireless neuralrecording IC,” IEEE Transactions on Biomedical Circuits and Systems,vol. 8, no. 4, pp. 497–509, Aug 2014.

[40] S. X. Diao, Y. J. Zheng, Y. Gao, S. J. Cheng, X. J. Yuan, andM. Y.Je, “A 50-Mb/s CMOS QPSK/O-QPSK transmitter employing injectionlocking for direct modulation,”IEEE Transactions on Microwave Theoryand Techniques, vol. 60, no. 1, pp. 120–130, Jan 2012.

[41] M. Chae, Z. Yang, M. Yuce, L. Hong, and W. Liu, “A 128-Channel6 mW Wireless Neural Recording IC With Spike Feature Extractionand UWB Transmitter,” IEEE Transactions on Neural Systems andRehabilitation Engineering, vol. 17, no. 4, pp. 312–321, 2009.

[42] J. M. Fan, P. Nuyujukian, and J. C. Kao et. al., “Intention estimation inbrain machine interfaces,”Journal of Neuroengg., vol. 11, no. 1, 2014.

[43] A. L. Orsborn, S. Dangi, H. G. Moorman, and J. M. Camena, “Closed-loop decoder adaptation on intermediate time-scales facilitates rapidBMI performance improvements independent of decoder initializationconditions,” IEEE Trans. Neural Syst. Rehabil. Engg., vol. 20, no. 4,pp. 468–477, July 2012.