Digital Audio Processing Lab, Dept. of EEThursday, June 17 th Data-Adaptive Source Separation for...

45
Digital Audio Processing Lab, Dept. of EE Thursday, June 17 th Data-Adaptive Source Separation for Audio Spatialization Supervisors: Prof. Preeti Rao and Prof. V. Rajbabu by Pradeep Gaddipati 08307029 M. Tech. project presentation

Transcript of Digital Audio Processing Lab, Dept. of EEThursday, June 17 th Data-Adaptive Source Separation for...

Slide 1

Data-Adaptive Source Separation for Audio SpatializationSupervisors:Prof. Preeti RaoandProf. V. RajbabubyPradeep Gaddipati08307029M. Tech. project presentationDigital Audio Processing Lab, Dept. of EEThursday, June 17th#Digital Audio Processing LabThursday, June 17thOutlineProblem statementAudio spatializationSource separationData-adaptive TFRConcentration measure (sparsity)Re-construction of signal from TFRPerformance evaluationData-adaptive TFR for sinusoid detectionConclusions and future work#Digital Audio Processing LabThursday, June 17th#Digital Audio Processing LabThursday, June 17thProblem statementSpatial audio surround soundcommonly used in movies, gaming, etc.suspended disbeliefapplicable when the playback device is located at a considerable distance from the listener

Mobile phonesheadphones for playbackspatial audio ineffective over headphoneslacks body reflection cues in-the-head localizationcant re-record so need for audio spatialization#Digital Audio Processing LabThursday, June 17th#Digital Audio Processing LabThursday, June 17thAudio spatializationAudio spatialization a spatial rendering technique for conversion of the available audio into desired listening configurationAnalysis separating individual sourcesRe-synthesis re-creating the desired listener-end configuration#Digital Audio Processing LabThursday, June 17th#Digital Audio Processing LabThursday, June 17thSource separationSource separation obtaining the estimates of the underlying sources, from a set of observations from the sensorsTime-frequency transformSource analysis estimation of mixing parametersSource synthesis estimation of sourcesInverse time-frequency representationMixtures (stereo)

Source 1

Source 2

Source 3

#Digital Audio Processing LabThursday, June 17th#Digital Audio Processing LabThursday, June 17thMixing modelAnechoic mixing modelmixtures, xisources, sj

Under-determined (M < N)M = number of mixturesN = number of sources

Mixing parametersattenuation parameters, aijdelay parameters,

Figure: Anechoic mixing model Audio is observed at the microphones with differing intensity and arrival times (because of propagation delays) but with no reverberationsSource: P. O. Grady, B. Pearlmutter and S. Rickard, Survey of sparse and non-sparse methods in source separation, International Journal of Imaging Systems and Technology, 2005.

#Digital Audio Processing LabThursday, June 17th#Digital Audio Processing LabThursday, June 17thMixtures

#Digital Audio Processing LabThursday, June 17th#Digital Audio Processing LabThursday, June 17thTime-frequency transform

#Digital Audio Processing LabThursday, June 17th#Digital Audio Processing LabThursday, June 17thTime-frequency representation of mixtures

Requirement for source separation [1]W-disjoint orthogonality

Source analysis (estimation of mixing parameters)

#Digital Audio Processing LabThursday, June 17th#Digital Audio Processing LabThursday, June 17thFor every time-frequency binestimate the mixing parameters [1]

Create a 2-dimensional histogrampeaks indicate the mixing parametersSource analysis (estimation of mixing parameters)

#Digital Audio Processing LabThursday, June 17th#Digital Audio Processing LabThursday, June 17thSource analysis (estimation of mixing parameters)

#Digital Audio Processing LabThursday, June 17th#Digital Audio Processing LabThursday, June 17th

MixtureSource 1Source 2Source 3SourcesMasksSource synthesis (estimation of sources)#Digital Audio Processing LabThursday, June 17th#Digital Audio Processing LabThursday, June 17th

MixtureSource 1Source 2Source 3Source synthesis (estimation of sources)#Digital Audio Processing LabThursday, June 17th#Digital Audio Processing LabThursday, June 17thSource estimation techniquesdegenerate unmixing technique (DUET) [1]lq-basis pursuit (LQBP) [2]delay and scale subtraction scoring (DASSS) [3]

Source synthesis (estimation of sources)#Digital Audio Processing LabThursday, June 17th#Digital Audio Processing LabThursday, June 17thSource synthesis (DUET)Every time-frequency bin of the mixture is assigned to one of the source based on the distance measure

#Digital Audio Processing LabThursday, June 17th#Digital Audio Processing LabThursday, June 17thSource synthesis (LQBP)Relaxes the assumption of WDO assumes at most M sources present at each T-F binM = no. of mixtures, N = no. of sources, (M < N)lq measure decides which M sources are present

#Digital Audio Processing LabThursday, June 17th#Digital Audio Processing LabThursday, June 17thSource synthesis (DASSS)Identifies which bins have only one dominant sourceuses DUET for that binsassumes at most M sources present in rest of the binserror threshold decides which M sources are present

#Digital Audio Processing LabThursday, June 17th#Digital Audio Processing LabThursday, June 17thInverse time-frequency transform

Est. source 1Est. source 2Est. source 3

Orig. source 1

Orig. source 2

Orig. source 3

Mixtures (stereo)

#Digital Audio Processing LabThursday, June 17th#Digital Audio Processing LabThursday, June 17thScope for improvementRequirement for source separationW-disjoint orthogonality (WDO) amongst the sources

Sparser the TFR of the mixtures [4]the less will be the overlap amongst the sources (i.e. higher WDO)easier will be their separation#Digital Audio Processing LabThursday, June 17th#Digital Audio Processing LabThursday, June 17thData-adaptive TFRFor music/speech signalsdifferent components (harmonic/transients/modulations) at different time-instantsbest window differs for different componentsthis suggests use of data-dependent time-varying window function to achieve a high sparsity [6]

To obtain sparser TFR of mixtureuse different analysis window lengths for different time-instants, the one which gives maximum sparsity#Digital Audio Processing LabThursday, June 17th#Digital Audio Processing LabThursday, June 17thData-adaptive TFR

Data-adaptive time-frequency representation of singing voice, window function = hammingwindow sizes = 30, 60 and 90 ms, hop size = 10 ms, conc. measure = kurtosis#Digital Audio Processing LabThursday, June 17th#Digital Audio Processing LabThursday, June 17thSparsity measure (concentration measure)What is sparsity ?small number of coefficients contain a large proportion of the energy

Common sparsity measures [5]KurtosisGini Index

Which sparsity measure to use for adaptation ?the one which shows the same trend as WDO as a function of analysis window size#Digital Audio Processing LabThursday, June 17th#Digital Audio Processing LabThursday, June 17thWDO and sparsity (some formulae)W-disjoint orthogonality [4]

Kurtosis

Gini Index

#Digital Audio Processing LabThursday, June 17th#Digital Audio Processing LabThursday, June 17thDataset descriptionDataset : BSS oracleSampling frequency : 22050 Hz10 sets each of music and speech signalsOne set : 3 signalsDuration : 11 seconds#Digital Audio Processing LabThursday, June 17th#Digital Audio Processing LabThursday, June 17thWDO and sparsityWDO vs. window sizeobtain TFR of the sources in a setobtain source-masks based on the magnitude of the TFRs in each of the T-F binsusing the source-masks and the TFR of the sources obtain the WDO measureNOTE: In case of data-adaptive TFR, obtain the TFR of sources using the window sequence obtained from the adaptation of the mixture

Sparsity vs. window sizeobtain the TFR of one of the channel of the sourcecalculate the frame-wise sparsity of the TFR of the mixture#Digital Audio Processing LabThursday, June 17th#Digital Audio Processing LabThursday, June 17thWDO vs. window size

#Digital Audio Processing LabThursday, June 17th#Digital Audio Processing LabThursday, June 17thKurtosis vs. window size

#Digital Audio Processing LabThursday, June 17th#Digital Audio Processing LabThursday, June 17thGini Index vs. window size

#Digital Audio Processing LabThursday, June 17th#Digital Audio Processing LabThursday, June 17thWDO and sparsity (observations)Highest sparsity (kurtosis/Gini Index) is obtained when data-adaptive TFR is used

Highest WDO is obtained by using data-adaptive TFR (with kurtosis as the adaptation)

Kurtosis is observed to have similar trend as that of WDO#Digital Audio Processing LabThursday, June 17th#Digital Audio Processing LabThursday, June 17thConstraint (introduced by source separation)TFR should be invertible

SolutionSelect analysis windows such that they satisfy constant over-lap add (COLA) criterion [7]

Techniquestransition windowmodified (extended) window

Inverse data-adaptive TFR#Digital Audio Processing LabThursday, June 17th#Digital Audio Processing LabThursday, June 17thTransition window technique

#Digital Audio Processing LabThursday, June 17th#Digital Audio Processing LabThursday, June 17thModified window technique

#Digital Audio Processing LabThursday, June 17th#Digital Audio Processing LabThursday, June 17thProblems with re-constructionTransition window techniqueadaptation carried out only on alternate framesWDO obtained amongst the underlying sources is less

Modified window techniquethe extended window as compared to a normal hamming window has larger side-lobesspreading the signal energy into neighboring binsWDO measure decreases#Digital Audio Processing LabThursday, June 17th#Digital Audio Processing LabThursday, June 17thDataset descriptionDataset BSS oracle

Mixtures per set (72 = 24 x 3)attenuation parameters (24 = 4P3){100, 300, 600, 800}Delay parameters{(0,0,0), (0, 1, 2), (0 2 1)}

A total of 720 (72 x 10) mixtures (test cases) for each of music and speech groups#Digital Audio Processing LabThursday, June 17th#Digital Audio Processing LabThursday, June 17thPerformance (mixing parameters)TFR (window size)hop size = 10msCases with correct estimation of sources (%)Error in estimation of mixing parametersAttenuation parameters(degrees)DelayParameter(no. of samples)STFT (30 ms)67.992.480.39STFT (60 ms)74.791.670.31STFT (90 ms)74.931.480.30ATFR (30, 60, 90 ms)79.510.790.25#Digital Audio Processing LabThursday, June 17th#Digital Audio Processing LabThursday, June 17thPerformance (source estimation)Evaluate the source-masks using one of the source estimation techniques (DUET or LQBP)

Using the set of estimated source-masks and the TFRs of the original sources calculate the WDO measure of each of the source-masks

WDO measure indicates how well the maskpreserves the source of interestsuppresses the interfering sources#Digital Audio Processing LabThursday, June 17th#Digital Audio Processing LabThursday, June 17thPerformance (source estimation)TFR (window size)hop size = 10msWDO measureDUETLQBPSTFT (30 ms)0.81610.6218STFT (60 ms)0.85580.6350STFT (90 ms)0.85820.6356ATFR (30, 60, 90 ms)0.86120.6362#Digital Audio Processing LabThursday, June 17th#Digital Audio Processing LabThursday, June 17th

Data-adaptive TFR (for sinusoid detection)Data-adaptive time-frequency representation of a singing voicewindow function = hamming; window sizes = 20, 40 and 60 ms; hop size = 10 ms, concentration measure = kurtosis; frequency range = 1000 to 3000 Hz#Digital Audio Processing LabThursday, June 17th#Digital Audio Processing LabThursday, June 17thData-adaptive TFR (for sinusoid detection)TFR(window size)hop size = 10msTrue hits (%)0 1500 Hz1000 3000 Hz2500 5000 HzSTFT (20 ms)91.2985.3376.98STFT (40 ms)95.6782.1668.16STFT (60 ms)86.7868.2464.95ATFR (20, 40, 60 ms)96.0986.0982.53

#Digital Audio Processing LabThursday, June 17th#Digital Audio Processing LabThursday, June 17thConclusionsMixing model anechoicKurtosis can be used as the adaptation criterion for data-adaptive TFRData-adaptive TFR provides higher WDO measure amongst the underlying sources as compared to fixed-window STFTBetter estimates of the mixing parameters and the sources are obtained using data-adaptive TFRPerformance of DUET is better than LQBP#Digital Audio Processing LabThursday, June 17th#Digital Audio Processing LabThursday, June 17thFuture workTesting of the DASSS source estimation technique

Re-construction of the signal from TFR

Need to consider a more realistic mixing model to account for reverberation effects, like echoic mixing model#Digital Audio Processing LabThursday, June 17th#Digital Audio Processing LabThursday, June 17thAcknowledgmentsI would like to thank Nokia, India for providing financial support and technical inputs for the work reported here #Digital Audio Processing LabThursday, June 17th#Digital Audio Processing LabThursday, June 17thReferencesA. Jourjine, S. Rickard and O. Yilmaz, Blind separation of disjoint orthogonal signals: demixing n sources from 2 mixtures, IEEE Conference on Acoustics, Speech and Signal Processing, 2000

R. Saab, O. Yilmaz, M. J. Mckeown and R. Abugharbieh, Underdetermined anechoic blind source separation via lq basis pursuit with q