Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow...

Parallelizing Audio Feature Extraction Using an Automatically-Partitioned

Streaming Dataflow Language

Eric BattenbergMark Murphy

CS 267, Spring 2008

Introduction

• World is going Multicore– Sequential performance has hit the “Three Walls”

• Power Wall + Memory Wall + ILP Wall = Brick Wall [Patterson]

• Software development is about to get really hard– For performance, must utilize parallel processors– Multiple, non-portable granularities:

• SIMD vs Multi-Core vs Multi-Socket vs Local-Stores vs GPU ...

The StreamIt Language

The StreamIt Compiler Strategy

• 1. Fuse Stateless filters – Eliminates communication and buffer copies

• 2. Data-Parallelize– Allow for concurrent execution of future work

• 3. Pipeline– Optimal partitioning of the work

Coarsen Granularity

Data Parallelize

Software Pipeline

Graphic from:Exploiting Coarse-Grained Task, Data, and Pipeline Parallelism in Stream Programs.Michael Gordon.ASPLOS 2006, San Jose, CA, October, 2006.

Audio Feature Extraction• Frequency-warped spectrum

– Human ear has better spectral resolution at lower frequencies

– To get necessary spectral resolution need FFT of length ~1024 (23ms)

– With frequency warping only need length 32(~1ms)

• Allows finer time resolution – Improves analysis of rhythmic

patterns and fast transients.

0 0.5 1 1.5 2 2.5

x 104

0

0.2

0.4

0.6

0.8

1

frequency [Hz}

mag

nitu

de r

espo

nse

16-point DFT

0 0.5 1 1.5 2 2.5

x 104

0

0.2

0.4

0.6

0.8

1

frequency [Hz}

mag

nitu

de r

espo

nse

16-point Warped Delay Line DFT

Audio Feature Extraction• Frequency-warped spectrum

– Human ear has better spectral resolution at lower frequencies

– To get necessary spectral resolution need FFT of length ~1024 (23ms)

– With frequency warping only need length 32(~1ms)

• Allows finer time resolution – Improves analysis of rhythmic

patterns and fast transients.

Stream Graph

• Produced by StreamIt compiler.

• Stream graph for a 4 point warped FFT.

• For an application could be up to 128 point.

• Fine-grained version of the algorithm.

Interesting Architectures: Intel Clovertown

• r41 in RAD cluster• Dual-Socket X Quad Core, 2.66 GHz 4-wide

pipelines

Interesting Architectures: AMD Opteron

• r28 in RAD cluster• Dual Socket X Dual Core, 2.2 GHz, 3-wide

pipelines

Interesting Architectures: UltraSparc T2 (Niagara2)

• r45 in RAD Cluster• 1 socket X 8-core X 8-threads, 1.4 GHz

– two-wide in-order pipelines, 1 FPU per core

StreamIt Scaling Performance

• StreamIt compiler thinks its doing a good job.• It isn't.

Portability Problems

• StreamIt code base in a primitive state– Unable to produce working code for Niagara– Frequently buggy even on x86 Multicores

• Bugs ranged from “simple” to “subtle”– Simple: Check return of fopen() for NULL– Subtle: Die nondeterministically with cryptic error

message• Lack of root access

– Made meeting compiler dependencies difficult

Dual-Core Performance• As a sanity check, ran on Core 2 Duo

T9300 (Penryn)– Eric’s Laptop, root access, (w00t)

• Coarse-grained implementation designed for easy parallelization– Shows speedup with –O1 flag but

is actually faster with –O0• Fine-grained version shows definite

slowdown.– But performs much faster than

coarse-grained.– Compilation failed at 64 and 128

(out of memory)

Dual-Core Performance

• Best runs of each implementation compared– Coarse: 2 Cores (-O0)– Fine: 1 Core (-O1)– Control: Hand-optimized Matlab implementation, 2 threads

• Single processor StreamIt performance very acceptable.– Considering the productivity

of the StreamIt language and the performance of the automatically-optimized

• Why can’t the multi-core code work this well yet on x86 architectures?

― No, seriously, I want it now. ―

Conclusions

• High-Level, Domain-Specific programming environments– Productivity win: we didn't have to write streaming

communication library• Compile-time transformations, lowering to C++

– Efficiency win: remove run-time overhead of High-Level language

• StreamIt optimizations incomplete– Agnostic of cache-coherence/communication

topologies

Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow...

Documents

Transcript of Parallelizing Audio Feature Extraction Using an Automatically-Partitioned Streaming Dataflow...