Sparse Deep Neural Network Graph Challenge - draftgraphchallenge.mit.edu/sites/default/files/... ·...

MIT LINCOLN LABORATORYS U P E R C O M P U T I N G C E N T E R

Jeremy Kepner, Simon Alford, Vijay Gadepally, Michael Jones, Lauren Milechin, Ryan Robinett, Sid Samsi

June, 2019

Sparse Deep Neural NetworkGraph Challenge

- draft -

Slide - 2



• Introduction

• Deep Neural Networks

• Sparse Data Sets

• Challenge

• Summary

Outline

Slide - 3



• GraphChallenge encourages community approaches to developing new solutions for analyzing graphs and sparse data

• Goal: to provide a well-defined community venue for stimulating research and highlighting innovations in graph analysis software, hardware, algorithms, and systems

• Audience: any individual or team that seeks to highlight their contributions to graph analysis software, hardware, algorithms, and/or systems

• Graph Challenge contributions should be submitted to the IEEE HPEC conference submission website (see ieee-hpec.org)

MIT/IEEE/Amazon GraphChallenge.org Goals

Slide - 4



a

b

c

d

g

h

i

j

• Static Graph Challenge: Sub-Graph Isomorphism – This challenge seeks to identify a given

sub-graph in a larger graph– Triangle counting and K-Truss

Current Graph Challenges

• Streaming Graph Challenge: Stochastic Block Parition– This challenge seeks to identify optimal

blocks (or clusters) in a larger graph

Streaming unstructured graph Block optimized graph

1

5

4

2

6

8 7

3

Graph H Graph G Isomorphism

f (a) = 1f (b) = 6f (c) = 8f (d) = 3f (g) = 5f (h) = 2f (i) = 4f (j) = 7

Slide - 5



GraphChallenge.org

Links to Graph Challenge benchmark specifications, data

and sample code

Slide - 6



Graph Datasets

Amazon AWS Public Datasets

Wikipedia Twitter Stanford Network Analysis Project

Yahoo! Graph500.orgAutonomous SystemsRoad Networks

• Public data available from a variety of sources• Data formats are typically not standardized (we have normalized!)

Slide - 7



Graph Datasets Hosted on Amazon

MMIO – Matrix Market I/O format. Available http://math.nist.gov/MatrixMarket/

Data available for download in multiple formats• Adjacency and Incidence matrix

for each graph• File format : ASCII Tab separated file

and MMIO format

Small graphs with known partitions available for block partitioning algorithm

Slide - 8



• Deep neural networks (DNNs) are at the heart of modern AI miracles• Larger neural networks often perform better

– Larger number of layers/features allow more non-linear boundaries– Problem: limited by expensive high-speed memory size– Solution: sparse (pruned) neural networks deliver comparable performance with less memory

New Challenge Goal:Enable Large Sparse Deep Neural Networks

InputFeatures

OutputClassification

y0 W0b0

W1b1

W2b2

W3b3

y2 y3

y4

y1

Dense Training

Dense Training,then Pruning

LeCun et al.,Hassibi et al.

Dense Trainingwhile Pruning

Srinivas et al.,Han et al.

Sparse Training Prabhu et al.(X-Net)

Density

Sparsity

Slide - 9



• Introduction



• Challenge

• Summary

Outline

Slide - 10



Machine Learning

Before 2010

Slide - 11



Machine Learning

After 2015

Deep NeuralNetworks

Slide - 12



86 1955 WESTERN JOINT COMPUTER CONFERENCE

Generalization of Pattern Recognition in a Self-Organizing System*

W. A. CLARKf AND B. G. FARLEYf

Summary—A self-organizing system reported upon earlier is briefly described. Two further experiments to determine its proper-ties have been carried out. The first demonstrates that self-organiza-tion still takes place even if the input patterns are subjected to con-siderable random variation. The second experiment indicates that, after organization with the usual fixed patterns, the system classifies other input patterns statistically according to a simple preponderance criterion. Significance of this result as a generalization in pattern recognition is discussed. Some remarks are made on methods of simulation of such systems and their relation to computer design.

D E S C R I P T I O N O F S E L F - O R G A N I Z I N G S Y S T E M

IN A P R E V I O U S paper 1 the au thors described a sys-t em which organized itself from an initially r andom condit ion to a s t a t e in which discr iminat ion of two

different i npu t p a t t e r n s 2 was accomplished. T h e be-hav ior of t he sys tem was s imulated b y means of a digi tal compu te r—th e M e m o r y T e s t C o m p u t e r of Lincoln Labora to ry .

Briefly, the self-organizing system was composed of two pa r t s . T h e first p a r t received i npu t p a t t e r n s and t ransformed t h e m into ou tpu t s , and the second p a r t ac ted upon pa rame te r s of t he first so as to modify the i n p u t - o u t p u t t ransformat ion according to cer tain fixed cri teria. These p a r t s were te rmed the t ransformat ion and the modifier, respectively.

T h e t ransformat ion is a r andomly in terconnected ne twork of nonlinear e lements , each e lement having a definite threshold for incoming excitat ion, below which no act ion occurs, and above which the e lement "fires." W h e n an e lement fires, i ts threshold immedia te ly rises effectively to infinity (it canno t be fired), and then , after a shor t fixed delay, falls exponent ial ly back toward i ts quiescent value . Fu r the rmore , a t some shor t t ime after firing, an e lement t r ansmi t s exci tat ion to all o ther eler m e n t s to which i t is connected. T h e effectiveness of the exci ta t ion t h u s t r an smi t t e d to a succeeding e lement is de te rmined b y a p rope r ty of the par t icu lar connection known as i ts "weight ." In general, there will be several incoming connect ions a t a n y element , each hav ing i ts individual weight as shown in Fig. 1. A t t he ins tan t of t ransmission (which is the t ime of impulse arr ival a t the succeeding e lement) , the appropr ia te weight is added to a n y exci ta t ion a l ready present a t the succeeding cell.

* The research reported in this document was supported jointly by the Army, the Navy, and the Air Force under contract with the Massachusetts Institute of Technology.

f Lincoln Laboratory, Massachusetts Institute of Technology, Lexington, Mass. 1 B. G. Farley and W. A. Clark, "Simulation of self-organizing systems by digital computer," Trans. IRE, vol. PGIT-4, pp. 76-84; September, 1954.

2 In this paper, the word "pattern" is synonymous with "con-figuration."

Thereaf ter the excitat ion decays exponent ia l ly to zero. If a t a n y t ime this exci tat ion exceeds t he threshold of the succeeding element, the e lement performs its firing cycle and t r ansmi t s i ts own exci tat ions .

Fig. 1—Typical network elements i and j showing connection weights w.

A ne twork such as the one described is suggestive of ne tworks of the nerve cells, or neurons , of physiology, b u t since t h e detai ls of neuron in terac t ion are a s ye t un-certain, i t canno t even be said t h a t the ne tworks are identical wi thou t some simplifications which are present .

In the work ment ioned, the ne twork was ac t iva ted and a n o u t p u t obta ined in the following way . T h e net was divided arb i t rar i ly into two groups, designated as i npu t and o u t p u t groups. T h e o u t p u t g roup was further subdivided in two, and an o u t p u t was defined a t a n y i n s t an t b y the difference in the n u m b e r of e lements fired in t he two subgroups dur ing the ins tan t . Th i s a r range-m e n t migh t be te rmed a push-pull o u t p u t .

T h e i npu t g roup was also subdivided in to two sub-groups, and two fixed inpu t p a t t e r n s were provided, usual ly designated as px and p2. I n p u t pi consisted in add ing a large excitat ion into all the i npu t e lements of one subgroup s imul taneously and repet i t ively a t a con-s t a n t period, b u t doing nothing to t he o the r subgroup. I n p u t p2 was jus t the reverse. In th is w a y o u t p u t ac-t iv i ty character is t ic of the inpu t p a t t e r n was obta ined.

I t was now desired to provide a modifier ac t ing upon pa rame te r s of the ne t so as to gradual ly reorganize it to ob ta in o u t p u t ac t iv i ty of a previously specified charac-terist ic, namely , t h a t pa t t e rn s pi and pi would a lways dr ive the o u t p u t in previously specified direct ions. In our exper iments , pi was made to dr ive t he o u t p u t in a negat ive direction, t h a t is to say, pi causes more firing to t a k e place on the average in t he first o u t p u t subgroup t h a n in the second. In the case of p%, t he s i tuat ion was exact ly reversed.

T h i s desired organizat ion of the net was accomplished b y means of va ry ing the weights ment ioned above in the following way . Examina t ion is m a d e of the change in o u t p u t a t every ins tan t . If a change in a favorable direc-t ion occurs (e.g. negat ive change in case pi is t he inpu t

Self ridge: Pattern Recognition and Modern Computers 91

lated by the program. In the event that the simulation experiment makes extensive use of such randomness it would be desirable to incorporate a source of uniformly-distributed random numbers as one of the electronic elements of the computer. Such an element would, of course, also be of great value in statistical work and monte-carlo calculations in general.

Finally, it is worth mentioning that simulation experi-ments involving partially random program behavior, unlike arithmetic computations, generally require the

INTRODUCTION

WE CONSIDER the process we call Pattern Recognition. By this we mean the extraction of the significant features of data from a back-

ground of irrelevant detail. What we are interested in is simulating this process on digital computers. We give examples on three levels of complexity corresponding to the subjects of the other three speakers here today. We examine in detail the problem on the second level, visual recognition of simple shapes.

Finally, we show how our attack on that problem can be extended so that the computer is essentially perform-ing a learning process and constructing new concepts on the basis of its experience.

PATTERN RECOGNITION

By pattern recognition we mean the extraction of the significant features from a background of irrelevant detail. We are interested in simulating this on digital computers for several reasons. First, it is the kind of thing that brains seem to do very well. Secondly, it is the kind of thing that computing machines do not do very well yet. Thirdly, it is a productive problem—it leads naturally to studying other processes, such as learning. And, finally, it has many fascinating applica-tions on its own.

We shall not review here the valuable work that has been done and is being done elsewhere.

EXAMPLES OF PATTERN RECOGNITION

Consider Fig. 1. The horizontal lines on the left differ from those on the right in having vertical spikes mostly at the left end. That is, here there are two patterns:

* The work in this paper was sponsored jointly by the U. S. Army, U. S. Navy, and U. S. Air Force under contract with M.I.T.

t Lincoln Laboratory, Massachusetts Institute of Technology, Lexington, Mass.

presence of the experimenter at the computer, a t least during the program checkout phase and subsequently whenever large changes in operating parameters are made. For this reason any features which assist the ex-perimenter in evaluating the operation of various parts of the program "on the spot" are of great value. In this category one might include programmed cathode-ray tube displays, audio output, and the ability to print out selected memory registers without stopping the com-puter.

those with a preponderance at the left end and those with a preponderance at the right end. The notion of simple preponderance or elemental discrimination is clearly one of the most primitive sources of patterns.

iiij L , . llhi.l

JL LLL

I lilll , L l l i Fig. 1

Here we have filtered each line from perhaps 100 bits down to just one. I t is this filtering that is pattern recognition.

Our next example is the visual recognition of simple shapes. This is a two-dimensional problem, of course, while the previous one was merely one-dimensional. Both the shapes in Fig. 2 are clearly squares though (1) they are in different places, (2) they have different sizes, (3) one is hollow, the other not, and (4) they have dif-ferent orientations.

Fig. 2

Our final example, like our first, divides all the con-figurations of data into two classes. From every chess

Pattern Recognition and Modern Computers 0. G. SELFRIDGEt Neural

Networks

Language


only one move deep, it is fundamentally grounded in the tactics, which extend much further into the future.

The decision is now made whether to make the avail-able set suffice, or whether to return and work some more: to add and modify the goals and tactics. This is a level-of-aspiration type of decision, which will depend not only on whether the alternatives are "good enough," but also on how much time remains, and whether the move is crucial. Only if the decision is made not to ex-plore and expand further, is the best alternative picked from the limited set and punched into the card as the machine's actual move.

The term "best alternative," is used in a very casual way. The evaluations consist of many numbers, at least one for each goal. It is clear that if a single alternative dominates all others, it should be chosen. It is also fairly clear that an alternative which achieves a very im-portant subgoal is to be preferred over one which only increases the likelihood of a few very subordinate ones. But basically this is a multiple value situation, and in general no such simple rules can be expected to indicate a single best action. The problem for the machine is not to somehow obtain a magic formula to solve the unsolv-able but to make a reasonable choice with least effort and proceed with more productive work. There are other ways to deal with the problem; for instance, in-clude conflict as a fundamental consideration in the decision to explore further.

Thus, at each move the machine can be expected to iterate several times until it achieves an alternative that it likes, or until it runs out of time and thus loses the game by not being smart enough or lucky enough.

PERFORMANCE SCHEMA

The pieces now exist to give an over-all schema for the performance system of the chess learning machine. This is a set of mechanisms which is sufficient to enable the machine to play chess. There is no learning in this system; it will play no better next time because it played this time. If the content of all the expressions required is appropriate, it will play good chess; if they are not, it will play very poor chess.

This performance system is highly adaptive. A goal structure peculiar to each play of the game is generated during the course of play. Tactics reflect the minute de-tail of the current situation. This short-run adaptability is not to be confused with learning which would perma-nently affect the way the machine would play in the future.

Fig. 1 gives the schema of operation. Rather than present as systematic and complete a representation as possible, attention has been given to relating the ele-ments discussed so far. The rectangles represent the major kinds of information in the system. These may be viewed as memories. The arrows indicate processes that operate on one kind of information to produce another. The small writing by these arrows relates these processes

to key words used earlier. Some of the main decisions are put in circles, since it makes the diagram easier to follow. The programs for carrying out most of these processes are the various nets, like the classification net. For the sake of clarity, these are not shown as explicit kinds of information, although they certainly occupy a large part of the computer's memory.

input

OPPONENTS MOVE

classify OPPONENTS GOALS

t eliminate 1 useless branches ikelihooab

CURRENT TACTICS

tactic se^rc h

.r select action from ea,ch

AVAILABLE ALTERNATIVES

get new s tep by s tep position i computation

EVALUATION FOR. EACH ALT

transformation rules

v*>c i choose"bes t ' Y " " edtern&tive

MACHINES MOVE

t out put Fig. 1—Schematic flow diagram for performance system.

Each sequence always starts with an opponent's move being received (at the top). The process continues (downward) by a series of straightforward computa-tions until the question is reached whether the situation is "good enough." This is the fundamental question. If the answer is yes, the machine has only to choose from among the available alternatives and play, thus ending the sequence (down). So far the effort spent is nominal. If, however, the answer is no, the machine proceeds to the modification and extension of the goals and tactics (to the right and up). This part is of indeterminate dur-ation and effort and utilizes all of the complex apparatus that has been built up. Following this, the machine again attempts to produce a move (downward again). This is the fundamental cycle: to try to decide on a

Strategy


Fig. 10—A4, after averaging with threshold 13. Fig. 6—A3, after averaging with threshold 5.

For a low threshold, such as 5 for a 5 X5 window, the image will be thickened. As the threshold is raised a thinning takes place. This is evident in Figs. 7 through 11. I t is particularly significant that for a threshold of 15, the corner point and two junction points are iso-lated. The same phenomenon is shown in Figs. 12 through 17. The thick A of Fig. 18 has a blank strip and one small hole. For the low thresholds these irregulari-ties are removed and for the high thresholds they are emphasized, as shown in Figs. 18 through 25.

Fig. 11—A4, after averaging with thresholdJ15.

Fig. 7—A4, input image. Fig. 12—A5, input image.



Vision

Miracles of AI

Slide - 13



InputFeatures


EdgesObject Parts

Objects

y0 W0b0

W1b1

W2b2

W3b3

y2 y3

y4

y1

yl+1 = h(ylWl + bl)

Deep Neural Network

Slide - 14



• Y0 is (# inputs) x (# features); each row is a feature vector– # features = # neurons/layer (constant for all layers)

• Wl is (# neurons) x (# neurons); Wl(i,j) > 0 means a connection from i to j• bl is a bias row vector applied to each input• h(x) is the ReLU function with a max cutoff

0 < x < 32, x is unchanged; x < 0, x is changed to 0; x > 32, x is changed to 32

Graph Challenge Math Conventions

Yl+1 = h(YlWl + bl)

• Sparse DNN Challenge uses standard graph community terminology– Row vectors for features and left matrix multiply to progress through network

• Standard AI definitions can be used (transpose all matrices and multiply on right)

Slide - 15



• Larger neural networks often perform better– Larger number of layers/features allow

more non-linear boundaries– Challenge: limited by GPU memory size– Solution: sparse neural networks

• Active research area in AI community

Next Frontier: Sparse Neural Networks

Published as a conference paper at ICLR 2018

EFFICIENT SPARSE-WINOGRADCONVOLUTIONAL NEURAL NETWORKS

Xingyu Liu⇤, Jeff Pool

†, Song Han

‡§, William J. Dally

⇤†⇤Stanford University, † NVIDIA, ‡ Massachusetts Institute of Technology, § Google Brain{xyl, dally}@stanford.edu

ABSTRACT

Convolutional Neural Networks (CNNs) are computationally intensive, whichlimits their application on mobile devices. Their energy is dominated by thenumber of multiplies needed to perform the convolutions. Winograd’s minimalfiltering algorithm (Lavin, 2015) and network pruning (Han et al., 2015) can reducethe operation count, but these two methods cannot be directly combined – applyingthe Winograd transform fills in the sparsity in both the weights and the activations.We propose two modifications to Winograd-based CNNs to enable these methodsto exploit sparsity. First, we move the ReLU operation into the Winograd domainto increase the sparsity of the transformed activations. Second, we prune theweights in the Winograd domain to exploit static weight sparsity. For models onCIFAR-10, CIFAR-100 and ImageNet datasets, our method reduces the number ofmultiplications by 10.4⇥, 6.8⇥ and 10.8⇥ respectively with loss of accuracy lessthan 0.1%, outperforming previous baselines by 2.0⇥-3.0⇥. We also show thatmoving ReLU to the Winograd domain allows more aggressive pruning.

1 INTRODUCTION

Deep Convolutional Neural Networks (CNNs) have shown significant improvement in many machinelearning applications. However, CNNs are compute-limited. Their performance is dominated by thenumber of multiplies needed to perform the convolutions. Moreover, the computational workloadof CNNs continues to grow over time. LeCun et al. (1998) proposed a CNN model with less than2.3⇥ 107 multiplies for handwritten digit classification. Later, Krizhevsky et al. (2012) developedAlexNet, an ImageNet-winning CNN with more than 1.1 ⇥ 109 multiplies. In 2014, ImageNet-winning and runner up CNNs increased the number of multiplies to 1.4⇥ 109 (Szegedy et al., 2015)and 1.6⇥ 1010 (Simonyan & Zisserman, 2015) respectively. Despite the powerful representationalability of large scale CNNs, their computational workload prohibits deployment on mobile devices.

Two research directions have been explored to address the problem. Lavin (2015) proposed usingWinograd’s minimal filtering algorithm (Winograd, 1980) to reduce the number of multiplies neededto perform 3⇥ 3 kernel convolutions. On the other end, pruning the model (Han et al., 2015; 2016b)and exploiting the dynamic sparsity of activations due to ReLU also reduces the required multiplies.

Unfortunately, the above two directions are not compatible: the Winograd transformation fills in thezeros in both the weights and the activations (Figure 1(a)) – eliminating the gain from exploitingsparsity. Thus, for a pruned network, Winograd’s algorithm actually increases the number ofmultiplies; the loss of sparsity more than offsets the reduced operation count.

In this paper, we introduce two modifications to the original Winograd-based convolution algorithm toeliminate this problem. First, we move the ReLU operation to be after the Winograd transform to alsomake the activations sparse at the point where the multiplies are performed. Second, we prune theweights after (rather than before) they are transformed. Thus, the weights are sparse when the element-wise multiply is performed — reducing the operation count. Together, these two modifications enablethe gains of Winograd’s algorithm and of exploiting sparsity to be combined. We open-source ourcode and models at https://github.com/xingyul/Sparse-Winograd-CNN.

1

arX

iv:1

802.

0636

7v1

[cs.C

V]

18 F

eb 2

018

SCNN: An Accelerator for Compressed-sparseConvolutional Neural Networks

Angshuman Parashar† Minsoo Rhu†

Anurag Mukkara‡ Antonio Puglielli⇤ Rangharajan Venkatesan† Brucek Khailany†

Joel Emer†‡ Stephen W. Keckler† William J. Dally†⇧

NVIDIA† Massachusetts Institute of Technology‡ UC-Berkeley⇤ Stanford University⇧

Abstract—Convolutional Neural Networks (CNNs) haveemerged as a fundamental technology for machine learning.High performance and extreme energy efficiency are critical fordeployments of CNNs in a wide range of situations, especiallymobile platforms such as autonomous vehicles, cameras, andelectronic personal assistants. This paper introduces the SparseCNN (SCNN) accelerator architecture, which improves perfor-mance and energy efficiency by exploiting the zero-valued weightsthat stem from network pruning during training and zero-valuedactivations that arise from the common ReLU operator appliedduring inference. Specifically, SCNN employs a novel dataflowthat enables maintaining the sparse weights and activationsin a compressed encoding, which eliminates unnecessary datatransfers and reduces storage requirements. Furthermore, theSCNN dataflow facilitates efficient delivery of those weights andactivations to the multiplier array, where they are extensivelyreused. In addition, the accumulation of multiplication productsare performed in a novel accumulator array. Our results showthat on contemporary neural networks, SCNN can improve bothperformance and energy by a factor of 2.7⇥ and 2.3⇥, respec-tively, over a comparably provisioned dense CNN accelerator.

I. INTRODUCTION

Driven by the availability of massive data and the com-putational capability to process it, deep learning has recentlyemerged as a critical tool for solving complex problems acrossa wide range of domains, including image recognition [20],speech processing [12], [16], [2], natural language process-ing [8], language translation [10], and autonomous vehi-cles [21]. Convolutional neural networks (CNNs) have becomethe most popular algorithmic approach for deep learning formany of these domains. Employing CNNs can be decomposedinto two tasks: (1) training — in which the parameters of aneural network are learned by observing massive numbers oftraining examples, and (2) inference — in which a trainedneural network is deployed in the field and classifies the

A version appears in the 44th IEEE/ACM International Symposium onComputer Architecture (ISCA-44), 2017.

This research was, in part, funded by the U.S. Government, under theDARPA CRAFT program. The views and conclusions contained in thisdocument are those of the authors and should not be interpreted asrepresenting the official policies, either expressed or implied, of the U.S.Government.

observed data. Today, training is often done on GPUs [24]or farms of GPUs, while inference depends on the applicationand can employ CPUs, GPUs, FPGA, or specially-built ASICs.

During the training process, a deep learning expert willtypically architect the network, establishing the number oflayers, the operation performed by each layer, and the connec-tivity between layers. Many layers have parameters, typicallyfilter weights, which determine their exact computation. Theobjective of the training process is to learn these weights,usually via a stochastic gradient descent-based excursionthrough the space of weights. This process typically employsa forward-propagation calculation for each training example,a measurement of the error between the computed and de-sired output, and then back-propagation through the networkto update the weights. Inference has similarities, but onlyincludes the forward-propagation calculation. Nonetheless, thecomputation requirements for inference can be prohibitivelylarge, particularly with the emergence of deeper networks(hundreds of layers [17], [18], [26], [29]) and larger inputssets, such as high-definition video. Furthermore, the energyefficiency of this computation is important, especially formobile platforms, such as autonomous vehicles, cameras, andelectronic personal assistants.

Recent published works have shown that common networkshave significant redundancy and can be pruned dramaticallyduring training without substantively affecting accuracy [15].Our experience shows that the number of weights that canbe eliminated varies widely across the layers but typicallyranges from 20% to 80% [15], [14]. Eliminating weightsresults in a network with a substantial number of zero values,which can potentially reduce the computational requirementsof inference.

The inference computation also offers a further optimizationopportunity. In specific, many networks employ as their non-linear operator the ReLU (rectified linear unit) function whichclamps all negative activation values to zero. The activationsare the output values of an individual layer that are passedas inputs to the next layer. Our experience shows that fortypical data sets, 50–70% of the activations are clamped tozero. Since the multiplication of weights and activations is thekey computation for inference, the combination of these two

arX

iv:1

708.

0448

5v1

[cs.N

E] 2

3 M

ay 2

017

Exploring the Regularity of Sparse Structure in

Convolutional Neural Networks

Huizi Mao1, Song Han

1, Jeff Pool

2, Wenshuo Li

3, Xingyu Liu

1,

Yu Wang3, William J. Dally

1,2

1Stanford University2NVIDIA

3Tsinghua University{huizi,songhan,dally}@stanford.edu

Abstract

Sparsity helps reduce the computational complexity of deep neural networks byskipping zeros. Taking advantage of sparsity is listed as a high priority in the nextgeneration DNN accelerators such as TPU[1]. The structure of sparsity, i.e., thegranularity of pruning, affects the efficiency of hardware accelerator design as wellas the prediction accuracy. Coarse-grained pruning brings more regular sparsitypatterns, making it more amenable for hardware acceleration, but more challengingto maintain the same accuracy. In this paper we quantitatively measure the trade-off between sparsity regularity and the prediction accuracy, providing insights inhow to maintain the accuracy while having more structured sparsity pattern. Ourexperimental results show that coarse-grained pruning can achieve similar sparsityratio as unstructured pruning given no loss of accuracy. Moreover, due to the indexsaving effect, coarse-grained pruning is able to obtain better compression ratiothan fine-grained sparsity at the same accuracy threshold. Based on the recentsparse convolutional neural network accelerator (SCNN), our experiments furtherdemonstrate that coarse-grained sparsity saves ⇠ 2⇥ of the memory referencescompared with fine-grained sparsity. Since memory reference is more than twoorders of magnitude more expensive than arithmetic operations, the regularity ofsparse structure leads to more efficient hardware design.

1 Introduction

Deep Neural Networks (DNNs) have many parameters, which leads to problems related to storage,computation and energy cost. State-of-art Convolutional Neural Network (CNN) models havehundreds of millions parameters and take tens of billions operations[2–4]. That makes DNN modelsdifficult to deploy on embedded systems with limited resources.

To deal with this problem, various methods have been proposed to compress DNN models and reducethe amount of computation. Some methods are based on decomposition and factorization[5, 6]. Thesemethods can preserve the regular dense computation structure of the original models, thus are ableto to achieve both compression and acceleration on general-purpose processors. Pruning serves asanother effective method to greatly reduce the number of parameters with no loss of accuracy[7, 8].

Pruning based methods are better at preserving accuracy as well as achieving higher compressionrates[7]. However, such improvements come at the cost of the irregularity of the sparse computationpattern. On the other side, structured pruning, such as pruning entire filters will cause larger accuracyloss than pruning individual weights[9]. Those observations pose several questions: What is the trade-

off between regularity and accuracy? Is it possible to find a sweet spot in the range of regularity?

How does the sweet spot improve the efficiency of hardware implementation?

31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.

arXi

v:17

05.08

922v

3 [c

s.LG]

5 Ju

n 201

7

GPU = Graphics Processing Unit

Slide - 16



• Introduction



• Challenge

• Summary

Outline

Slide - 17



• Large sparse DNNs are difficult to obtain from real data– Similar to early days of Graph500.org– First step is to simulate data with desired topological properties– Emphasis on the “hard” part of the problem: large sparse DNNs

• Approach: RadiX-Net1 sparse DNN generator– Efficiently generates a wide range of pre-determined DNNs– Specific number of connections per neuron– Equal number of paths between all inputs/outputs and intermediate layers

• DNN generation algorithm– Mixed radices generate base DNNs of specified connectedness– Kronecker product base DNNs to create larger/sparser DNNs– Randomly permute and chain DNNs to create very large DNNs

Sparse DNN Data

1RadiX-Net: Structured Sparse Matrices for Deep Neural Networks, Robinett & Kepner, IEEE IPDPS GrAPL Workshop 2, 2019

Slide - 18



Sparsity, path-connectedness, and symmetry using mixed-radix numeral systems

…

…

…

…

…

…

…

…

Input Layer

Output Layer

…

…

Slide - 19



RadiX-Net topologies: diversity in layer sizes allowed by Kronecker product

Slide - 20



• Base DNNs (all with 32 connections per neuron) generated using RadiX-Net

• 6 layers, 1024 neurons/layer– Using radix = [[2,2,2,2,2,2]] kron = [16,16,16,16,16,16,16]

• 8 layers, 4096 neurons/layer– Using radix = [[2,2,2,2,2,2,2,2]] kron = [16,16,16,16,16,16,16,16,16]

• 10 layers, neurons/layer = 16384– Using radix = [[2,2,2,2,2,2,2,2,2,2]] kron = [16,16,16,16,16,16,16,16,16,16,16]

• 12 layers, neurons/layer = 65536– Using radix = [[2,2,2,2,2,2,2,2,2,2,2,2]] kron = [16,16,16,16,16,16,16,16,16,16,16,16,16]

• All non-zero weights set to 1/16 (weights into any neuron sum to 2)– Observed to produce layer count independent properties (theory in progress)

Graph Challenge Base DNNs

Slide - 21



• 6 layer, 64 neurons per layer, 2 connections per neuron RadiX-Net DNN produced from Radix set [[2,2,2,2,2,2]]

• Kronecker product of this DNN with [16,16,16,16,16,16,16] produces a 6 layer, 1024 neurons per layer DNN with 32 connections per neuron

Example RadiX-Net Graph

Slide - 22



• Base DNNs are randomly permuted and replicated to generate 12 DNNs• Each layer is written out as a .tsv file: row, column, weight

Graph Challenge Full DNNs

1024 4096 16384 65536120 3,932,160 15,728,640 62,914,560 251,658,240480 15,728,640 62,914,560 251,658,240 1,006,632,9601920 62,914,560 251,658,240 1,006,632,960 4,026,531,840La

yers

Neurons/Layer

Total Edges = 32 x layers x neurons/layer

• Bias values selected to produce reasonable number of output for MNIST input

1024 4096 16384 65536Bias -0.3 -0.35 -0.4 -0.45

Neurons/Layer

Slide - 23



• MNIST corpus of 60,000 28x28 pixel images of handwritten numbers is a staple of AI

• Sparse DNN Graph Challenge uses interpolated sparse versions of this entire corpus as input– Each image is resized to 32x32 (1024 neurons), 64x64 (4096 neurons), 128x128 (16384

neurons), and 256x256 (65536 neurons)– Each sparse resized corpus is save as .tsv file: row, column, 1

MNIST input data

MNIST originals 28x28 MNIST sparse 256x256

The MNIST Database, Cortes, LeCun & Burges, http://yann.lecun.com/exdb/mnist/

Slide - 24



• Introduction



• Challenge

• Summary

Outline

Slide - 25



• Download from GraphChallenge.org: DNNs weight matrices Wl, sparse MNIST input data Y0, and truth categories

• Load a DNN and its correspond input• Create and set the appropriate sized bias vectors bl from the table• Evaluate the ReLU DNN equation for all layers (timed)

Yl+1 = h(YlWl+ bl)• Identify the categories (rows) in final matrix with entries > 0 (timed)• Compare computed categories with truth categories to check correctness• Compute the rate connection rate for the DNN: (# inputs) x (# connections) / time • Report time and rate for each DNN measured

Challenge Instructions

Slide - 26



Example ReLU Sparse DNN Inference Matlab Code

function Y = inferenceReLUvec(W,bias,Y0);% Performs ReLU inference using input feature vector(s) Y0, DNN weights W, and bias vectors.

YMAX = 32; % Set max value.Y = Y0; % Initialize feature vectors.

for i=1:length(W) % Loop through each weight layer W{i}

% Propagate through layer.% Note: using graph convention of A(i,j) means connection from i *to* j,% that requires *left* multiplication of feature *row* vectors.Z = Y*W{i};b = bias{i};

Y = Z + (double(logical(Z)) .* b); % Apply bias to non-zero entries. Y(Y < 0) = 0; % Threshold negative values.Y(Y > YMAX) = YMAX; % Threshold maximum values.

endreturnend

Slide - 27



Avoid• Exploiting repetitive structure of weight matrices, weight values, bias values• Exploiting layer independence of results• Using optimizations that would not work on real-world data

Do• Use an implementation that could work on real-world data• Create compressed binary versions of inputs to accelerate reading the data• Split inputs and run in data parallel mode to achieve higher performance

– Requires replicating weight matrices on every processor (can require a lot of memory)

• Split up layers and run in pipeline parallel mode to achieve higher performance– Saves memory, but requires communicating results after each group of layers

• Use other reasonable optimizations that would work on real-world data

Recommendations

Slide - 28



Example Serial Performance

Neuronsper Layer

Layers Connections (edges)

Inference Time (second)

Inference Rate (input1 x edge

/sec) 1024 120 3,932,160 626 376x106

1024 480 15,728,640 2440 386x106

1024 1920 62,914,560 9760 386x106

4096 120 15,728,640 2446 385x106

4096 480 62,914,560 10229 369x106

4096 1920 251,658,240 40245 375x106

16384 120 62,914,560 10956 344x106

16384 480 251,658,240 45268 333x106

16384 1920 1,006,632,960 179401 336x106

65536 120 251,658,240 45813 329x106

65536 480 1,006,632,960 202393 298x106

65536 1920 4,026,531,840

1Number of MNIST inputs = 60,000

Slide - 29



• Serial reference implementation run in parallel by splitting feature vectors

• DNN replicated• Run on MIT SuperCloud• Intel KNL processors with 192

GB RAM

Example Data Parallel Performance

100 150 200 250 300 350 400 450 500 550 6000.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2 1011

N=1024, L=120N=1024, L=480N=1024, L=1920N=4096, L=120N=4096, L=480N=4096, L=1920N=16384, L=120N=16384, L=480N=16384, L=1920N=65536, L=120N=65536, L=480N=65536, L=1920

Number of Processors

Rat

e(#

inpu

ts) x

(# c

onne

ctio

ns) /

tim

e

Slide - 30



• MIT/IEEE/Amazon Graph Challenge has provides a valuable venue for innovators in graph and sparse network analysis system

• Sparse DNN Challenge seeks stimulate new work in the area of sparse AI

• Proposed challenge provides a wide range of simulated sparse DNNs to evaluate

Summary

InputFeatures


y0 W0b0

W1b1

W2b2

W3b3

y2 y3

y4

y1

Deep Neural Network (DNN)

Slide - 31



• David Martinez• Jeremy Kepner• Albert Reuther• Vijay Gadepally• Michael Jones• Lauren Milechin• Bill Arcand• Bill Bergeron• David Bestor

• Chansup Byun• Matt Hubbell• Pete Michaleas• Julie Mullen• Andy Prout• Tony Rosa• Jason Williams• Ryan Whelan• Charles Yee

Acknowledgements

• Paul Monticciolo• Sanjeev Mohindra• Steven Smith• Edward Kao• Michael Hurley• William Song

Sparse Deep Neural Network Graph Challenge - draftgraphchallenge.mit.edu/sites/default/files/... ·...

Documents

Transcript of Sparse Deep Neural Network Graph Challenge - draftgraphchallenge.mit.edu/sites/default/files/... ·...