Audio Media Processing -...

35
1 Audio Media Processing Graduate School of Informatics Kyoto University Kazuyoshi Yoshii [email protected]

Transcript of Audio Media Processing -...

Page 1: Audio Media Processing - sap.ist.i.kyoto-u.ac.jpsap.ist.i.kyoto-u.ac.jp/members/yoshii/lectures/... · Hello! 10 Signal processing + Machine learning + Application . Bayesian

1

Audio Media Processing

Graduate School of Informatics Kyoto University

Kazuyoshi Yoshii [email protected]

Page 2: Audio Media Processing - sap.ist.i.kyoto-u.ac.jpsap.ist.i.kyoto-u.ac.jp/members/yoshii/lectures/... · Hello! 10 Signal processing + Machine learning + Application . Bayesian

2 Audio Media Processing

Statistical machine learning

Bayesian theory, deep neural network,

optimization

Symbol processing

Phrasal, syntactic, topical analysis

Signal processing

Separation, identification, dereverberation

Computational auditory model

Page 3: Audio Media Processing - sap.ist.i.kyoto-u.ac.jpsap.ist.i.kyoto-u.ac.jp/members/yoshii/lectures/... · Hello! 10 Signal processing + Machine learning + Application . Bayesian

3 Keyword: Listen to Sound • Listen to “speech” ▪ Prince-Shotoku robot

◆ Simultaneous speech recognition ▪ Microphone-array processing

◆ Sound source separation and dereverberation • Listen to “music” ▪ Music understanding and performance

◆ Sound source separation and music transcription ◆ Co-playing and accompaniment

• Listen to “environmental sounds” ▪ Object detection in a disaster environment ▪ Analysis of frog calling

Page 4: Audio Media Processing - sap.ist.i.kyoto-u.ac.jpsap.ist.i.kyoto-u.ac.jp/members/yoshii/lectures/... · Hello! 10 Signal processing + Machine learning + Application . Bayesian

4

Listen to Speech

Page 5: Audio Media Processing - sap.ist.i.kyoto-u.ac.jpsap.ist.i.kyoto-u.ac.jp/members/yoshii/lectures/... · Hello! 10 Signal processing + Machine learning + Application . Bayesian

5 Shotoku-Taishi (Prince Shotoku) • Legendary person who can recognize simultaneous

utterances of ten persons

Page 6: Audio Media Processing - sap.ist.i.kyoto-u.ac.jpsap.ist.i.kyoto-u.ac.jp/members/yoshii/lectures/... · Hello! 10 Signal processing + Machine learning + Application . Bayesian

6

• Isolated word → Continuous speech • Evaluation using three robots in a large room

Simultaneous Speech Recognition

Closeness between speakers

2002

2003

2005

2006

Page 7: Audio Media Processing - sap.ist.i.kyoto-u.ac.jpsap.ist.i.kyoto-u.ac.jp/members/yoshii/lectures/... · Hello! 10 Signal processing + Machine learning + Application . Bayesian

7 Open-Source Software for Robot Audition HARK

Page 8: Audio Media Processing - sap.ist.i.kyoto-u.ac.jpsap.ist.i.kyoto-u.ac.jp/members/yoshii/lectures/... · Hello! 10 Signal processing + Machine learning + Application . Bayesian

8 Shotoku-Taishi Robot 2012 • Can many-ears robots go beyond humans?

16ch microphone-array processing・ Sound directions are given

Page 9: Audio Media Processing - sap.ist.i.kyoto-u.ac.jpsap.ist.i.kyoto-u.ac.jp/members/yoshii/lectures/... · Hello! 10 Signal processing + Machine learning + Application . Bayesian

9 Microphone Array Processing • Separate mixture signals into unknown number of sound

sources with unknown reverberation time

How are you?

Hello!

Page 10: Audio Media Processing - sap.ist.i.kyoto-u.ac.jpsap.ist.i.kyoto-u.ac.jp/members/yoshii/lectures/... · Hello! 10 Signal processing + Machine learning + Application . Bayesian

10

Signal processing

+ Machine learning

+ Application

Bayesian Unified Formulation • A modern principled approach to the conventional egg-

and-chicken problem ▪ C.f. Errors are propagated in a cascade framework

(localization → separation → dereverberation)

Source dereverberation

Source localization

Source separation

Nonparametric Bayesian model

Hot topic!

Page 11: Audio Media Processing - sap.ist.i.kyoto-u.ac.jpsap.ist.i.kyoto-u.ac.jp/members/yoshii/lectures/... · Hello! 10 Signal processing + Machine learning + Application . Bayesian

11 Localization + Separation + Dereverberation

With dereverberation

Without dereverberation

Simultaneous two utterances

Page 12: Audio Media Processing - sap.ist.i.kyoto-u.ac.jpsap.ist.i.kyoto-u.ac.jp/members/yoshii/lectures/... · Hello! 10 Signal processing + Machine learning + Application . Bayesian

12 Application to Real Environment • Separate overlapping utterances in a noisy environment

Clock-tower international hall Microphone array

Observed mixture signal

Separated signals

Utterances Background noise

Page 13: Audio Media Processing - sap.ist.i.kyoto-u.ac.jpsap.ist.i.kyoto-u.ac.jp/members/yoshii/lectures/... · Hello! 10 Signal processing + Machine learning + Application . Bayesian

13

Listen to Music

Page 14: Audio Media Processing - sap.ist.i.kyoto-u.ac.jpsap.ist.i.kyoto-u.ac.jp/members/yoshii/lectures/... · Hello! 10 Signal processing + Machine learning + Application . Bayesian

14 Music Co-playing with Humans • Real-time score-position tracking ▪ Listen to partner’s playing by using own ears

The robot can deal with tempo change by co-player

Page 15: Audio Media Processing - sap.ist.i.kyoto-u.ac.jpsap.ist.i.kyoto-u.ac.jp/members/yoshii/lectures/... · Hello! 10 Signal processing + Machine learning + Application . Bayesian

15 Lyric-to-Audio Synchronization • Efficient navigation to a section of interest

Page 16: Audio Media Processing - sap.ist.i.kyoto-u.ac.jpsap.ist.i.kyoto-u.ac.jp/members/yoshii/lectures/... · Hello! 10 Signal processing + Machine learning + Application . Bayesian

16 Music Understanding • Parts-based representation of music ▪ Combinations of “pitches” and “timbres”

Timbres (filters) Pitches (sources)

⊗Composition

Superposition

Music audio signals

Page 17: Audio Media Processing - sap.ist.i.kyoto-u.ac.jpsap.ist.i.kyoto-u.ac.jp/members/yoshii/lectures/... · Hello! 10 Signal processing + Machine learning + Application . Bayesian

17 Music Signal Decomposition • Timbre-based audio source separation

Reconstructed spectrogram

Observed spectrogram

Timbre weights

Estimated filters (timbres)

Page 18: Audio Media Processing - sap.ist.i.kyoto-u.ac.jpsap.ist.i.kyoto-u.ac.jp/members/yoshii/lectures/... · Hello! 10 Signal processing + Machine learning + Application . Bayesian

18 Replacement of Drum Parts • Edit only drum parts in mixture signals

Page 19: Audio Media Processing - sap.ist.i.kyoto-u.ac.jpsap.ist.i.kyoto-u.ac.jp/members/yoshii/lectures/... · Hello! 10 Signal processing + Machine learning + Application . Bayesian

19 Replacement of Guitar Solo • Edit only a guitar part while preserving original timbres

Page 20: Audio Media Processing - sap.ist.i.kyoto-u.ac.jpsap.ist.i.kyoto-u.ac.jp/members/yoshii/lectures/... · Hello! 10 Signal processing + Machine learning + Application . Bayesian

20 Timbre and Effect Estimation • Preserve timbers and reverberation of original guitar solo

Guitar part (phrase)

Accompanying parts

Reverb.

・Audio signal ・Music score

Part separation and effect estimation

Score of new guitar solo

Synthesized guitar solo

Timbre

Page 21: Audio Media Processing - sap.ist.i.kyoto-u.ac.jpsap.ist.i.kyoto-u.ac.jp/members/yoshii/lectures/... · Hello! 10 Signal processing + Machine learning + Application . Bayesian

21 Songle: Active Music listening Service • We can enjoy automatically estimated, visualized, and

sonificated musical elements of songs on the Web

Global view

Zoom view

Structure

Chords

Melody

Beats

songle.jp

Page 22: Audio Media Processing - sap.ist.i.kyoto-u.ac.jpsap.ist.i.kyoto-u.ac.jp/members/yoshii/lectures/... · Hello! 10 Signal processing + Machine learning + Application . Bayesian

22 A New Form of Outreach Activity • We can amplify user contributions by using machine-

learning techniques ▪ Corrections by some users → Retraining → Accuracy improvement

→ Reward to all users

Beat correction

Melody correction

Chord correction

Structure correction

Page 23: Audio Media Processing - sap.ist.i.kyoto-u.ac.jpsap.ist.i.kyoto-u.ac.jp/members/yoshii/lectures/... · Hello! 10 Signal processing + Machine learning + Application . Bayesian

23

Listen to Environmental Sounds

Page 24: Audio Media Processing - sap.ist.i.kyoto-u.ac.jpsap.ist.i.kyoto-u.ac.jp/members/yoshii/lectures/... · Hello! 10 Signal processing + Machine learning + Application . Bayesian

24 Give Ears to Flying/Snake Robots • Robot audition could help in disaster

日経新聞2013年3月24日

Page 25: Audio Media Processing - sap.ist.i.kyoto-u.ac.jpsap.ist.i.kyoto-u.ac.jp/members/yoshii/lectures/... · Hello! 10 Signal processing + Machine learning + Application . Bayesian

25 Audition in Flying Robots • Use a microphone array for localization

Page 26: Audio Media Processing - sap.ist.i.kyoto-u.ac.jpsap.ist.i.kyoto-u.ac.jp/members/yoshii/lectures/... · Hello! 10 Signal processing + Machine learning + Application . Bayesian

26 Visualization of Frog Calling • Discriminate calling of two kinds of frogs

Page 27: Audio Media Processing - sap.ist.i.kyoto-u.ac.jpsap.ist.i.kyoto-u.ac.jp/members/yoshii/lectures/... · Hello! 10 Signal processing + Machine learning + Application . Bayesian

27

• Separation and localization in a park

Visualization of Bird Singing

Reiji Suzuki@Nagoya Univ.

Page 28: Audio Media Processing - sap.ist.i.kyoto-u.ac.jpsap.ist.i.kyoto-u.ac.jp/members/yoshii/lectures/... · Hello! 10 Signal processing + Machine learning + Application . Bayesian

28

Robot Audition

Page 29: Audio Media Processing - sap.ist.i.kyoto-u.ac.jpsap.ist.i.kyoto-u.ac.jp/members/yoshii/lectures/... · Hello! 10 Signal processing + Machine learning + Application . Bayesian

29 Why Robot Audition?

Conventional problem: We need to speak around microphones

The microphones inevitably catch noise sound with target utterances Why?

Foo Bar

Hello ♪

Our approach: We aim to separate and recognize sounds

Sound Source Localization (SSL)

Sound Source Separation (SSS)

Computational Auditory Scene Analysis (CASA)

Page 30: Audio Media Processing - sap.ist.i.kyoto-u.ac.jpsap.ist.i.kyoto-u.ac.jp/members/yoshii/lectures/... · Hello! 10 Signal processing + Machine learning + Application . Bayesian

30 Robot Audition for MC Robots • Robot MCs need to interact with multiple people ▪ Use an open-source robot audition software “HARK”

developed by Honda Research Institute Japan and Kyoto University ▪ HATTACK 25: Speech-based quiz game

A player position can be identified

by his or her voice

Inspired by the well-known quiz game in Japan

Page 31: Audio Media Processing - sap.ist.i.kyoto-u.ac.jpsap.ist.i.kyoto-u.ac.jp/members/yoshii/lectures/... · Hello! 10 Signal processing + Machine learning + Application . Bayesian

31 Demo : HATTACK25 • Players can barge in when the robot is talking about

questions ▪ To answer, players have to say “yes!” first ▪ Impossible for standard dialogue systems

Panel Players

Page 32: Audio Media Processing - sap.ist.i.kyoto-u.ac.jpsap.ist.i.kyoto-u.ac.jp/members/yoshii/lectures/... · Hello! 10 Signal processing + Machine learning + Application . Bayesian

32 Robot Audition for Flying Robots • Localize source sources on the ground by using

flying robots with microphones

Sound Source

Sound Source Map

Mic. Array

Estimated Sound Source Location

Robot audition is disturbed by self-generating noise

1. Learn self-generating noise (with Gaussian Process)

2. Suppress noise from input

Video of the flying robot

Page 33: Audio Media Processing - sap.ist.i.kyoto-u.ac.jpsap.ist.i.kyoto-u.ac.jp/members/yoshii/lectures/... · Hello! 10 Signal processing + Machine learning + Application . Bayesian

33 Robot Audition for Rescue Robot • Estimate robot shapes and detect sound sources in

collapsed buildings

How to estimate microphone positions on the robot? Statistical signal processing techniques

Length: 3-8m & Width: 3-5cm

Move forward with self locomotion

Moving hose-shaped robot

Time Difference of Arrival

HELP!

We designed a state space model of robot posture and estimated the posture by measuring TDOA of sounds

Page 34: Audio Media Processing - sap.ist.i.kyoto-u.ac.jpsap.ist.i.kyoto-u.ac.jp/members/yoshii/lectures/... · Hello! 10 Signal processing + Machine learning + Application . Bayesian

34

• Accurately estimate shapes even when obstacles exist

Posture Estimation for a Hose-shaped Robot

Wall Correct

Estimated

Robot (3.5[m])

Page 35: Audio Media Processing - sap.ist.i.kyoto-u.ac.jpsap.ist.i.kyoto-u.ac.jp/members/yoshii/lectures/... · Hello! 10 Signal processing + Machine learning + Application . Bayesian

35 Keyword: Listen to Sound • Listen to “speech” ▪ Prince-Shotoku robot

◆ Simultaneous speech recognition ▪ Microphone-array processing

◆ Sound source separation and dereverberation • Listen to “music” ▪ Music understanding and performance

◆ Sound source separation and music transcription ◆ Co-playing and accompaniment

• Listen to “environmental sounds” ▪ Object detection in a disaster environment ▪ Analysis of frog calling