High-Performance Tensor Contraction without...

Devin A. MatthewsInstitute for Computational Engineering and Sciences (ICES),

The University of Texas at Austin

High-Performance Tensor Contraction without BLAS

Tensor computations – in particular tensor contraction (TC) – are important kernels inmany scientific computing applications (SCAs). Due to the fundamental similarity of TC tomatrix multiplication (MM) and to the availability of optimized implementations such as theBLAS, tensor operations have traditionally been implemented in terms of BLASoperations, incurring both a performance and a storage overhead. Instead, we implementTC using the much more flexible BLIS framework, which allows for reshaping of thetensor to be fused with internal partitioning and packing operations, requiring no explicitreshaping operations or additional workspace. This implementation, TBLIS, achievesperformance approaching that of MM, and in some cases considerably higher than that oftraditional TC. Our implementation also supports multithreading using an approachidentical to that used for MM in BLIS, with similar performance characteristics. Thecomplexity of managing tensor- to-matrix transformations is also handled automatically inour approach, greatly simplifying its use in SCAs.

ReferencesGoto approach: K. Goto and R.A. van de Geijn, Trans. Math. Softw. 34, 3, Article 12 (2008).BLIS: github.com/flame/blis, shpc.ices.utexas.edu/publications.htmlTBLIS: github.com/devinamatthews/tblis, D. Matthews, arXiv:1607.00291.GETT (independent ”native” TC algorithm): P. Springer and P. Bientinesi, arXiv:1607.00145.

micro&kernel+

2nd+loop+around+micro&kernel+

1st+loop+around+micro&kernel+

5th+loop+around+micro&kernel+

4th+loop+around+micro&kernel+

3rd+loop+around+micro&kernel+

Pack Ai → Ai ~

Pack Bp → Bp ~

A Bj Cj

Ai ~ Bp

Bp ~ Ci

L3+cache+L2+cache+L1+cache+registers+

main+memory+

L3#cache#L2#cache#L1#cache#registers#

main#memory#

Matrix7like#logical#layout#

Packed#internal#layout#

kC mC mC

C" A" B"+=#

Tensor#physical#layout#

#1st#loop#around#micro7kernel#

#micro7kernel#

Pack “Ai”→ Ai ~

Pack “Bp”→ Bp ~

Tensor to

matrix packing kernel

(Possibly)scatteredupdate

Tabcde 6⇥ 3⇥ 2⇥ 3⇥ 4

18⇥ 24

Tensor

Matrix

“a” dimension is stride-1, other dimensions haveincreasing strides (6, 18, 36, 108).

Column “cdb” dimension has stride of “c” (6x3=18) most of the time. Blocking lets us use this constant stride when possible, and a scattered algorithm otherwise.

Row “ae” dimension similarly is stride-1 most of the time. (i.e. M is “row-major”)

[rc]scatM stores offset for each position in rows or columns. [rc]bsM stores stride for each block or zero for irregular blocks.

M(cdb)(ae)

• Large memory overhead.• Transpose does not

parallelize well.• Difficult to optimize out

transposes.

taeim V mkec

W akic

A = B · C

“TTDT”

Tensortranspose

“LoG”

taeim V mkecW ak

ic �

[tam]ei [Vmk]ec[Wak]ic8 i, e, c

for ifor efor cGEMM

• Low data reuse.• Poor cache behavior.• Inner kernel is not always

GEMM — may be level 1 or 2 BLAS.

L3#cache#L2#cache#L1#cache#registers#

main#memory#

Matrix7like#logical#layout#

Packed#internal#layout#

kC mC mC

C" A" B"+=#

Tensor#physical#layout#

#1st#loop#around#micro7kernel#

#micro7kernel#

Pack “Ai”→ Ai ~

Pack “Bp”→ Bp ~

“Native” (TBLIS)

taeim V mkecW ak

ic�=

• Performance nearly identical to GEMM.

Tensor Contraction

Cij = Σk Aki � Bkj

Cijk = Σmn Akmni � Bnjm

Tensor contraction:

is essentially a generalization of…

Matrix multiplication:

for n-dimensional arrays (tensors). The most common current methods for computing tensor contractions are:

• Transpose-transpose-DGEMM-transpose (TTDT), which reshapes (transposes) the tensors into matrices.

• Loop-over-GEMM (LoG), which uses explicit loops over a small GEMM operation.

A high-performance “native” algorithm is ideal.

Leveraging the BLIS Framework Tensors as Matrices: The Block-Scatter Matrix Results and Conclusions

Random “square” tensor contractions with TTDT and TBLIS, compared to matrix multiplication.Even for “square” cases where overhead is minimized, TTDT incurs as much as a 50% penalty,while TBLIS is consistent at 5-10%. TBLIS is also highly scalable. Since TBLIS uses the BLISmicrokernels and blocking parameters, performance improvements in BLIS and porting to newarchitectures automatically carries over to TBLIS.

Speedup of TBLIS over TTDT for the GETT benchmark (arXiv:1607.00145), which spansmemory-bound contractions (left) to compute-bound contractions (right). TTDT is efficient forcompute-bound problems because transposition is only O(n2), but TBLIS speedup increasessharply for more memory-bound problems. The left-most contractions have non-unit-strideaccess during packing, which hinders TBLIS. This overhead can be overcome by using a 3Dpacking kernel (future work).

BLIS partitions the input matrices into successively smaller and smaller blocks,eventually reaching the small, fixed-size microkernel, such as 4x4 (pictured above),8x4, 6x8, etc. These same “register blocksizes” are used during packing.

TBLIS uses this fact to encode tensor blocks as regular matrix blocks wheneverpossible, so that no overhead is incurred. Irregular blocks which do not have aconstant row or column stride are handled specially by storing the offset for eachrow and column individually.

The tensor dimensions can also be reordered to assure unit-stride access in atleast two of the tensors (see next panel for the effects of non-unit-stride access).

BLIS extends the high-performance Goto approach to matrixmultiplication, such as used in GotoBLAS and OpenBLAS, byadding two partitioning loops in plain C, and only requiring a smallmicrokernel to be highly optimized in assembly.

TBLIS uses this five-loop framework to perform tensor contractionby logically treating the tensor as a matrix during partitioning, andonly requiring extra tensor-aware steps (yellow and green above)during packing of the input tensors and updating the output tensor.The high-performance microkernel and cache blockingparameters are used unchanged, and overall overhead is small.

Threading is simple to implement, and can be added hierarchicallyto four out of the five loops, leading to high scalability.

TBLIS on

0 1000 2000 3000 4000 5000

Approx. M=N=K

MKLBLIS

TBLISTTDT

0 500 1000 1500 2000

Approx. M=N=K

MKLBLIS

TBLISTTDT

Xeon E5-2690 v31 core 12 cores

abcde-efb

abcde-efc

abcd-dbea-ec

abcde-ecbfa

abcd-deca-be

abc-bda-dc

abcd-ebad-ce

abcdef-

dega-gfb

abcdef-

b-geac

abcdef-

degb-gfa

abcdef-

degc-gfa

abc-dca-bd

abcd-ea-ebcd

abcd-eb-aecd

abcd-ec-abed

abc-adec-ebd

ab-cad-dcb

ab-acd-dbc

abc-acd-db

abc-adc-bd

ab-ac-cb

abcd-aebf-

abcd-eafd

abcd-aebf-

Speedup o

abcde-efb

abcde-efc

abcd-dbea-ec

abcde-ecbfa

abcd-deca-be

abc-bda-dc

abcd-ebad-ce

abcdef-

dega-gfb

abcdef-

b-geac

abcdef-

degb-gfa

abcdef-

degc-gfa

abc-dca-bd

abcd-ea-ebcd

abcd-eb-aecd

abcd-ec-abed

abc-adec-ebd

ab-cad-dcb

ab-acd-dbc

abc-acd-db

abc-adc-bd

ab-ac-cb

abcd-aebf-

abcd-eafd

abcd-aebf-

Speedup o

Xeon E5-2690 v31 core 12 cores

0 1 2 3 4 5 108

101 1 01

rscatM rbsM

cscatM

High-Performance Tensor Contraction without...

Documents

Transcript of High-Performance Tensor Contraction without...

Tensor Algebra and Tensor Analysis for Engineers ... · Tensor Algebra and Tensor Analysis for Engineers ... 2.1 Vector- and Tensor-Valued Functions, ... 38 2 Vector and Tensor Analysis

TENSOR TENSOR DE CADENA DE CADENA

Role of Tensors in Deep Learning - ANU College of ...users.cecs.anu.edu.au/~koniusz/tensors-cvpr17/present/anandkumar_anima... · Tensor Contraction Extends the notion of matrix product

Tensor contraction engine & extensible many-electron theory module in NWChem

FEYNMAN MOTIVES AND DELETION-CONTRACTION · 2009-07-18 · FEYNMAN MOTIVES AND DELETION-CONTRACTION RELATIONS 3 and contraction. To obtain an explicit deletion–contraction relation,

Contraction Mapping

Tensor Contraction for Parsimonious Deep Netstensorlab.cms.caltech.edu/users/...presentation.pdf · •Potential space savings •Performance improvement. Tensor Contraction •Tensor

M.Sc Mathematics (Two year program) - Home - MathCity.org · B. Cartesian Tensors (2/9) Tensors of different ranks, Inner and outer products, contraction theorem, Kronecker tensor

Strassen’s Algorithm for Tensor Contraction QUICK TIPS

tensor algebra - invariants 03 - tensor calculus - tensor ... · 03 - tensor calculus - tensor analysis tensor calculus 2 tensor algebra - invariants ¥ ... ¥ consider scalar,vector

tensor calculus 02 - tensor calculus - tensor algebrabiomechanics.stanford.edu/me337/me337_s02.pdf · 02 - tensor calculus - tensor algebra tensor calculus 2 tensor the word tensor

Tensor ﬁelds - sorbonne-universite · EXAMPLES OF APPLICATIONS • Gradient tensor • Stress tensor • Diffusion tensor • Metric/Curvature tensor • Structure tensor 6. REPRESENTATION

Tensor Contraction Engine Manual2.3 Usage of the Algebraic Expression Engine AEE TheAEEisintendedtogeneratethealgebraicexpressionfromCoupled ClusterorManyBodyPerturbationTheoryusingonlyoperatorsandtheir

Muscle contraction

Physiology of muscle contraction - Szegedi …€¦ · · 2017-10-03Physiology of muscle contraction ... Isotonic contraction Isometric contraction Passive tension: produced by

Skeletal Muscle Contraction and ATP Demandrrobergs/426L3Contraction.pdfMuscle Contraction Fall, 2010 PEP 426: Muscle Contraction & ATP Demand 1 Skeletal Muscle Contraction and ATP

A massively parallel tensor contraction framework for ... · Quantum chemistry is the ﬁeld of science focused on the application of quantum mechanics to the study of electrons in

MitochondrialCalciumUptakeRegulatesRapidCalcium TransientsinSkeletalMuscleduringExcitation-Contraction … · MitochondrialCalciumUptakeRegulatesRapidCalcium TransientsinSkeletalMuscleduringExcitation-Contraction

1 03 - tensor calculus 03 - tensor calculus - tensor analysis.

Finite element analysis of planar 4:1 contraction flow with the tensor … · 2005-01-11 · Finite element analysis of planar 4:1 contraction flow with the tensor-logarithmic formulation