Chris Burges, John Platt, Jon Goldstein, Erin Renshaw Microsoft Research Name That Tune: Stream...

37
Chris Burges, John Platt, Jon Goldstein, Erin Renshaw Microsoft Research Name That Tune: Name That Tune: Stream Audio Fingerprinting Stream Audio Fingerprinting

Transcript of Chris Burges, John Platt, Jon Goldstein, Erin Renshaw Microsoft Research Name That Tune: Stream...

Page 1: Chris Burges, John Platt, Jon Goldstein, Erin Renshaw Microsoft Research Name That Tune: Stream Audio Fingerprinting.

Chris Burges, John Platt,Jon Goldstein, Erin Renshaw

Microsoft Research

Name That Tune:Name That Tune:Stream Audio FingerprintingStream Audio Fingerprinting

Page 2: Chris Burges, John Platt, Jon Goldstein, Erin Renshaw Microsoft Research Name That Tune: Stream Audio Fingerprinting.

OutlineOutline

• What is audio fingerprinting?

• Robust Audio Recognition Engine Demo

• Some ideas for applications

• Other companies doing audio fingerprinting

• How does RARE work?

• How well does RARE work?

Page 3: Chris Burges, John Platt, Jon Goldstein, Erin Renshaw Microsoft Research Name That Tune: Stream Audio Fingerprinting.

What is Audio Fingerprinting?What is Audio Fingerprinting?

• Prepare database: Given audio clips, extract a ‘fingerprint’ from each clip, store

• Clip is not changed, unlike watermarking

• Run: Compare ‘traces’, computed from an audio stream, against your database of stored fingerprints, to identify what’s playing. Confirm using second fingerprint.

Page 4: Chris Burges, John Platt, Jon Goldstein, Erin Renshaw Microsoft Research Name That Tune: Stream Audio Fingerprinting.

Why Is This A Research Topic?Why Is This A Research Topic?- - Audio Feature ExtractionAudio Feature Extraction - -

• Map high dim space of audio samples to low dim feature space

• Must be robust to likely input distortions

• Must be informative (different audio segments map to distant points)

• Must be efficient (use small fraction of resources on typical PC)

Page 5: Chris Burges, John Platt, Jon Goldstein, Erin Renshaw Microsoft Research Name That Tune: Stream Audio Fingerprinting.

• Identifies clips based on 6 second “fingerprints”

• 6 seconds mapped to 64 floats, generated once every 0.186 seconds (or 0.372 sec)

• Each 64-float vector checked against large database of fingerprints (e.g., 240,000 )

• End-to-end system requires approx. 10% CPU on 746 MHz PC

• Confirmatory fingerprint can be used to add further robustness

The RARE Fingerprinting SystemThe RARE Fingerprinting System

Page 6: Chris Burges, John Platt, Jon Goldstein, Erin Renshaw Microsoft Research Name That Tune: Stream Audio Fingerprinting.

How Confirmatory Fingerprint WorksHow Confirmatory Fingerprint Works

NONO

NONO

YES

IN DATABASE?

FIND SINGLE CONFIRM FP

MATCH? WINNER!YES

Page 7: Chris Burges, John Platt, Jon Goldstein, Erin Renshaw Microsoft Research Name That Tune: Stream Audio Fingerprinting.

RARE DemoRARE Demo• 7 songs chosen at random, out of 239,209• Play clean song, then play distorted version

Critical Mass: Good Morning: Clip to distort

The Glands: Free Jane: ‘Old Radio’ filter

Eddie Cantor: Joe Is Here: 8% Time Compress

Ella Fitzgerald: Somebody Loves Me: Add Reverb

Perl Bailey: They Didn’t Believe Me: Repeat Echo

Next: Stop, Drop and Roll: 20dB Noise Gate

Joy Division: Digital: Add treble to distort

Page 8: Chris Burges, John Platt, Jon Goldstein, Erin Renshaw Microsoft Research Name That Tune: Stream Audio Fingerprinting.

OutlineOutline

• What is audio fingerprinting?

• Robust Audio Recognition Engine Demo

• Some possible applications

• Other companies doing audio fingerprinting

• How does RARE work?

• How well does RARE work?

Page 9: Chris Burges, John Platt, Jon Goldstein, Erin Renshaw Microsoft Research Name That Tune: Stream Audio Fingerprinting.

Identifying Broadcast Music IIdentifying Broadcast Music I

• You’re listening to FM radio at home

• You’d like to know what just played…

• …your eHome media center can tell you

Page 10: Chris Burges, John Platt, Jon Goldstein, Erin Renshaw Microsoft Research Name That Tune: Stream Audio Fingerprinting.

Making WMP SmarterMaking WMP Smarter

• You’ve made your own CDs by mixing and matching songs from your collection

• You’d like Windows Media Player to tell you what’s playing!

Page 11: Chris Burges, John Platt, Jon Goldstein, Erin Renshaw Microsoft Research Name That Tune: Stream Audio Fingerprinting.

Identifying Broadcast Music IIIdentifying Broadcast Music II

• You’re listening to FM radio on the road

• You’d like to know what just played, by holding a cell phone or PDA up to the source

Page 12: Chris Burges, John Platt, Jon Goldstein, Erin Renshaw Microsoft Research Name That Tune: Stream Audio Fingerprinting.

Database CleaningDatabase Cleaning

• You’re managing a large database of audio

• You’d like to automatically identify and remove duplicates

• You’d like to add metadata automatically to the rest

Page 13: Chris Burges, John Platt, Jon Goldstein, Erin Renshaw Microsoft Research Name That Tune: Stream Audio Fingerprinting.

Royalty Assessment / DRM / Royalty Assessment / DRM / Commercial TrackingCommercial Tracking

• ASCAP wants to know what radio stations in a given area are playing

• Businesses want to know if the commercial they paid for aired, and when, and if it was ‘lexiconed’

• Content providers want to know when pirated content is played

Page 14: Chris Burges, John Platt, Jon Goldstein, Erin Renshaw Microsoft Research Name That Tune: Stream Audio Fingerprinting.

OutlineOutline

• What is audio fingerprinting?

• Robust Audio Recognition Engine Demo

• Some ideas for applications

• Other companies doing audio fingerprinting

• How does RARE work?

• How well does RARE work?

Page 15: Chris Burges, John Platt, Jon Goldstein, Erin Renshaw Microsoft Research Name That Tune: Stream Audio Fingerprinting.

Other AF EffortsOther AF Efforts

• AUDIBLE MAGIC: Partnered with Loudeye, CMJ, SESAC.

• RELATABLE: Used by Napster. Uses 30 sec.

• MOODLOGIC: File based, p2p royalty tracking.

• PHILIPS: Cell-phone based music recognition service.

• SHAZAM: Cell-phone based, UK market

• MSN Music: File based

Page 16: Chris Burges, John Platt, Jon Goldstein, Erin Renshaw Microsoft Research Name That Tune: Stream Audio Fingerprinting.

Why use RARE?Why use RARE?

• RARE is flexible. Features are NOT hard-wired – they can easily be re-learned to handle new kinds of noise

• RARE is very robust and very fast• This is Microsoft’s own technology: easily

molded to different applications, we own the code and IP (2 patents filed)

Page 17: Chris Burges, John Platt, Jon Goldstein, Erin Renshaw Microsoft Research Name That Tune: Stream Audio Fingerprinting.

OutlineOutline

• What is audio fingerprinting?

• Robust Audio Recognition Engine Demo

• Some ideas for applications

• Other companies doing audio fingerprinting

• How does RARE work?

• How well does RARE work?

Page 18: Chris Burges, John Platt, Jon Goldstein, Erin Renshaw Microsoft Research Name That Tune: Stream Audio Fingerprinting.

Feature ExtractionFeature Extraction

Output 64 floats every half-frame

Downsample to 11.025 KHz mono

Compute Robust ProjectionsCollate Samples

(Repeat)

De-EqualizePerceptual Thresholding

Compute time-frequencymap, take magnitudes

Page 19: Chris Burges, John Platt, Jon Goldstein, Erin Renshaw Microsoft Research Name That Tune: Stream Audio Fingerprinting.

De-EqualizationDe-Equalization

Before After

De-equalize by flattening the log spectrum.

Page 20: Chris Burges, John Platt, Jon Goldstein, Erin Renshaw Microsoft Research Name That Tune: Stream Audio Fingerprinting.

Perceptual ThresholdingPerceptual Thresholding

1 KHz

1.5 KHz

1 KHz

3 KHz

50 dB

50 dB

Frequency

LogPower

So: remove coefficients that are below a perceptual threshold to lower unwanted variance.

Page 21: Chris Burges, John Platt, Jon Goldstein, Erin Renshaw Microsoft Research Name That Tune: Stream Audio Fingerprinting.

Feature Extraction, cont.Feature Extraction, cont.

Output 64 floats every half-frame

Downsample to 11.025 KHz mono

Compute Robust ProjectionsCollate Samples

(Repeat)

De-EqualizePerceptual Thresholding

Compute time-frequencymap, take magnitudes

Page 22: Chris Burges, John Platt, Jon Goldstein, Erin Renshaw Microsoft Research Name That Tune: Stream Audio Fingerprinting.

Distortion Discriminant AnalysisDistortion Discriminant Analysis

2048

372ms, step every 186ms

6.1s, step every 186ms

64 64

32

64

64

OPCA

OPCA

Page 23: Chris Burges, John Platt, Jon Goldstein, Erin Renshaw Microsoft Research Name That Tune: Stream Audio Fingerprinting.

Oriented PCAOriented PCA

δ2

δ1

Song A

Distorted Versionof Song A

Song BA

good

pro

ject

ion

dir.

Want δ1 much smaller than δ2

Page 24: Chris Burges, John Platt, Jon Goldstein, Erin Renshaw Microsoft Research Name That Tune: Stream Audio Fingerprinting.

• The variance of the signal along unit vector is , where is the signal covariance matrix

• The sum of squared differences of (signal-noisy signal) projected along is , where is the correlation matrix of the ‘noise’

• Find that maximize

Oriented PCA, DetailsOriented PCA, Details

n

n

1'n C n 1C

2'n C n2C

1

2

'

'

n C n

n C nn

Page 25: Chris Burges, John Platt, Jon Goldstein, Erin Renshaw Microsoft Research Name That Tune: Stream Audio Fingerprinting.

Distortion Discriminant AnalysisDistortion Discriminant Analysis

2048

372ms, step every 186ms

6.1s, step every 186ms

64 64

32

64

64

OPCA - can trainwith different

noise!

OPCA

DDA is a convolutional neural network with weights trained using OPCA

Page 26: Chris Burges, John Platt, Jon Goldstein, Erin Renshaw Microsoft Research Name That Tune: Stream Audio Fingerprinting.

RARE Training DataRARE Training Data

• 200 20s audio segments, catted (67 min)• Compute 11 distortions• Perform one layer of OPCA• Subtract mean projections of train data,

norm noise to unit variance• Repeat• Use 10 20s segments as validation set to

compute scale factor for final projections

Page 27: Chris Burges, John Platt, Jon Goldstein, Erin Renshaw Microsoft Research Name That Tune: Stream Audio Fingerprinting.

The DistortionsThe Distortions

Original

3:1 compress above 30dB

Light Grunge

Distort Bass

“Mackie Mid boost” filter

Old Radio

Phone

RealAudio© Compander

Page 28: Chris Burges, John Platt, Jon Goldstein, Erin Renshaw Microsoft Research Name That Tune: Stream Audio Fingerprinting.

OutlineOutline

• What is audio fingerprinting?

• Robust Audio Recognition Engine Demo

• Some ideas for applications

• Other companies doing audio fingerprinting

• How does RARE work?

• How well does RARE work?

Page 29: Chris Burges, John Platt, Jon Goldstein, Erin Renshaw Microsoft Research Name That Tune: Stream Audio Fingerprinting.

Test Results – Large Test SetTest Results – Large Test Set

• 500 songs, 36 hours of music• One stored fingerprint computed for every song,

about 10s in, random start point• Test alignment noise, false positive and negative

rates, same with distortions• Define distance between two clips as minimum

distance between fingerprint of first, and all traces of the second

• Approx. 3.5 x108 chances for a false positive

Page 30: Chris Burges, John Platt, Jon Goldstein, Erin Renshaw Microsoft Research Name That Tune: Stream Audio Fingerprinting.

DDA for Alignment RobustnessDDA for Alignment Robustness

Train second layer with alignment distortion:

0 20 40 60 80 100 120 140 160 1800

0.005

0.01

0.015

0.02

0.025

0.03

0.035

0.04

No Shift TrainingWith Shift Training

Alignment shift / ms

Min

Tar

get

/Tar

get s

q. d

ist

Min

Ta

rge

t/Non

Tar

get

sq.

dis

t

Page 31: Chris Burges, John Platt, Jon Goldstein, Erin Renshaw Microsoft Research Name That Tune: Stream Audio Fingerprinting.

Test Results: PositivesTest Results: Positives

0 50 100 150 200 250 300 350 400 450 5000

0.005

0.01

0.015

0.02

0.025

0.03

372ms steps186ms steps

Fingerprints, sorted by distance

Nor

mal

ized

dis

tanc

e sq

uare

d

Page 32: Chris Burges, John Platt, Jon Goldstein, Erin Renshaw Microsoft Research Name That Tune: Stream Audio Fingerprinting.

Test Results: NegativesTest Results: Negatives

Smallest ‘negative’ score: 0.14, largest ‘positive’ score: 0.026

0 10 20 30 40 50 60 70 80 90 1000.12

0.14

0.16

0.18

0.2

0.22

0.24

0.26

0.28

0.3

Smallest 100 out of 249,500target/non-target values

Nor

mal

ized

dis

tanc

e sq

uare

d

Page 33: Chris Burges, John Platt, Jon Goldstein, Erin Renshaw Microsoft Research Name That Tune: Stream Audio Fingerprinting.

Tests for Other DistortionsTests for Other Distortions

• Take all 500 test clips, grouped 10x50

• Add random alignment shift to all clips

• Distort each group (10 different distortions)

• RESULTS: false positive rate 4.10-6, false negative rate 0.2%

• But that was without using confirmation!

Page 34: Chris Burges, John Platt, Jon Goldstein, Erin Renshaw Microsoft Research Name That Tune: Stream Audio Fingerprinting.

Tests for Other Distortions, Tests for Other Distortions, cont.cont.

0 0.1 0.2 0.3 0.4 0.5 0.6 0.70

10

20

30

40

50

60

70

80

90

Target / NontargetTarget / Target

Weighted Euclidean Distances

Co

un

ts

First 500 out of ¼ million

Page 35: Chris Burges, John Platt, Jon Goldstein, Erin Renshaw Microsoft Research Name That Tune: Stream Audio Fingerprinting.

Lookup: Computational CostLookup: Computational Cost

• Overall system uses < 10% CPU on 850MHz P3 for 240K fingerprints

• Current lookup can handle 160 streams on a quad 2 GHz machine.

• With local caching, server caching, local pruning, and taking advantage of P4 architecture, expect at least a factor of 10 faster: 1,600 streams / machine

Page 36: Chris Burges, John Platt, Jon Goldstein, Erin Renshaw Microsoft Research Name That Tune: Stream Audio Fingerprinting.

Bitvector yields 30x SpeedupBitvector yields 30x Speedup

X

110011

111000

110000

... ......

X

&

List of FPscompatible

with bin

1st Dimension 2nd Dimension

Searchthese!

Page 37: Chris Burges, John Platt, Jon Goldstein, Erin Renshaw Microsoft Research Name That Tune: Stream Audio Fingerprinting.

ConclusionsConclusions

• RARE gives Microsoft robust, fast, stream audio fingerprinting

• RARE is raring to go…

• For technical details: http://msrweb/~cburges

• Contact aliases: cburges, jplatt