Time-Scale Modification of Speech Signals Bill Floyd ECE 5525 – Digital Speech Processing December...
Click here to load reader
-
Upload
charla-crawford -
Category
Documents
-
view
234 -
download
0
Transcript of Time-Scale Modification of Speech Signals Bill Floyd ECE 5525 – Digital Speech Processing December...
Time-Scale Modificationof Speech Signals
Bill Floyd
ECE 5525 – Digital Speech Processing
December 14, 2004
Slide 2 of 49
Objectives
Introduction Background Theory
Methods Examples
Matlab Code Short Time Fourier Transform Short Time Fourier Transform Magnitude Speech Samples
Conclusion Questions References
Slide 3 of 49
Introduction
Goal To either speed up or slow down a speech
signal while maintaining the approximate pitch Applications
Change voice mail playback Court stenographers-play proceedings quicker Sound effects Etc…
Slide 4 of 49
Introduction
Option 1 – Change sample rate If you modify the sample rate, you can change
the speed but the pitch is also changed Increase sample rate = higher pitch (chipmunk
sound) Decrease sample rate = lower pitch (drawn out
echo sound) Option 2 – Decimate or Interpolate Signal
If you change the number of samples, the result is the same as modifying the sample rate
Slide 5 of 49
Introduction
Option 3 – Use more complex methods This will change the speed of the sample while
preserving the pitch data Short Time Fourier Transform Short Time Fourier Transform Magnitude Sinusoidal Synthesis Linear Prediction Synthesis
Slide 6 of 49
Terminology
0 100 200 300 400 500 600 7000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1Window Representation
Window Size
Frame Rate
Slide 7 of 49
Theory
Short Time Fourier Transform Methods Chapter 7 in our text (Discrete-Time Speech
Signal Processing) Refer to notes from in class for mathematical
theory of operation I will pick up from where Dr. Kepuska stopped
in his notes
Slide 8 of 49
Short Time Fourier Transform
Short Time Fourier Transform Also called the Fairbanks method Extract successive short-time segments and
then discard the following ones
STFTDecimateSamples
IFFT
OLA
Signal
Output
Slide 9 of 49
Short Time Fourier Transform
Frame Rate factor L In frequency domain after taking the STFT,
you get X(nL,ω)
Form a new signal by Y(nL, ω) = X(snL, ω)
where s = compression factor
Take Inverse Fourier Transform Use Overlap and Add method to form new
signal
Slide 10 of 49
Short Time Fourier Transform
0 100 200 300 400 500 600 700 8000
0.2
0.4
0.6
0.8
1
0 100 200 300 400 500 600 700 8000
0.2
0.4
0.6
0.8
1
X(nL, ω)
Y(nL, ω)= X(2nL, ω)
Slide 11 of 49
Short Time Fourier Transform
0 100 200 300 400 500 600 7000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1Window Representation
0 100 200 300 400 500 600
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
New Sequence
OriginalWindowedSequence
Slide 12 of 49
Short Time Fourier Transform
Problems Pitch Synchronization
It is highly likely that the pitch periods will not line up properly
Slide 13 of 49
Short Time Fourier Transform Magnitude Short Time Fourier Transform Magnitude
Problems with STFT method relate directly to the linear phase component of the STFT
Time shift = phase change Alternate approach is to only use the
magnitude portion of the STFT—Short Time Fourier Transform Magnitude
Slide 14 of 49
Short Time Fourier Transform Magnitude Compression
With the Fairbanks method, time slices were discarded
Now we can just compress the time slices Form a new signal by
|Y(nM, ω)| = |X(nL, ω)| where M = compression factor = L / speed i.e. for speeding up by two => M = L/2
Slide 15 of 49
Short Time Fourier Transform Magnitude Compression
Take Inverse Fourier Transform Use Overlap and Add method to form new
signal
Slide 16 of 49
Short Time Fourier Transform Magnitude
0 100 200 300 400 500 600 700 8000
0.2
0.4
0.6
0.8
1
0 100 200 300 400 500 600 700 8000
0.2
0.4
0.6
0.8
1
X(nL, ω)
Y(nM, ω)= X(nL, ω)
M=L/2
Slide 17 of 49
Short Time Fourier Transform Magnitude
0 100 200 300 400 500 600 7000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1Window Representation
New Sequence
OriginalWindowedSequence
-50 0 50 100 150 200 250 300 350 400 4500
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Slide 18 of 49
Other Methods
Sinusoidal Synthesis—Chapter 9 Time-warp the sinewave frequency track and
the amplitude function This technique has been successful with not
only speech but also music, biological, and mechanical signals
Problems Does not maintain the original phase relations Suffer from reverberance
Slide 19 of 49
Other Methods
Linear Prediction Synthesis Use Homomorphic and Linear Prediction
results to modify the time base Book briefly mentions this is possible but ran
out of time before I could investigate this process more
Slide 20 of 49
Other Methods
New Techniques Internet search showed several methods
trying to improve on what is out there now Software
Different software programs that will change speed for you
Adobe Audition is one of the most all encompassing right now
Slide 21 of 49
Matlab Code-Prepare the Workspace
%%%%%%%%%%%%%%%%% Prepare Workspace%%%%%%%%%%%%%%%%
close all;clear all;
window_size_1 = 200;frame_rate_1 = 100;
%Speed to slow down byspeed = 2;
Slide 22 of 49
Matlab Code-Load the Speech Signal
%%%%%%%%%%%%%%%%% Load Data File%%%%%%%%%%%%%%%%
filename = input('Please enter the file name to be used. ');
[sample_data,sample_rate,nbits] = wavread(filename);
loop_time = floor(max(size(sample_data))/frame_rate_1);
sample_data((max(size(sample_data))):(loop_time+1)* frame_rate_1)=0;
Slide 23 of 49
Matlab Code-Develop the Window
%%%%%%%%%%%%%%%%% Create Windows%%%%%%%%%%%%%%%%
% Want windows of 25ms% File sampled at 10,000 samples/sec% Want a window of size 10000 * 25ms(10ms)
triangle_30ms = triang(window_size_1);%triangle_30ms = hamming(window_size_1);
W0 = sum(triangle_30ms);
Slide 24 of 49
Matlab Code-Window the Entire Speech Signal
%%%%%%%%%%%%%%%%% Window the speech%%%%%%%%%%%%%%%%
for i =0:loop_time-1
window_data(:,i+1)=sample_data((frame_rate_1*i)+1:((i+2)* frame_rate_1)).*triangle_30ms;
end
Slide 25 of 49
Matlab Code-Perform the Fast Fourier Transform
%%%%%%%%%%%%%%%%% Create FFT%%%%%%%%%%%%%%%%
for i = 1:loop_time
window_data_fft(:,i) = fft(window_data(:,i),1024);
end
Slide 26 of 49
Matlab Code-Recreate the Modified Signal
%%%%%%%%%%%%%%%%% Recreate Original Signal%%%%%%%%%%%%%%%%
%Initialize the recreated signals
reconstructed_signal(1:(loop_time+1)*frame_rate_1)=0;real_reconstructed_signal(1:(loop_time+1)*frame_rate_1)=0;
modified_reconstructed_signal(1:(loop_time+3)*(frame_rate_1/speed))=0;
modified_reconstructed_signal_compressed(1:(loop_time+3)* (frame_rate_1/ speed))=0;
Slide 27 of 49
Matlab Code-Recreate the Modified Signal
% Perform the ifft
for i = 1:loop_time recreated_data_ifft(:,i) = ifft(window_data_fft(:,i),1024); real_recreated_data_ifft(:,i) = ifft(abs(window_data_fft(:,i)),1024);
truncated_recreated_data_ifft(:,i) = recreated_data_ifft(1:window_size_1,i).*(frame_rate_1/W0);
real_truncated_recreated_data_ifft(:,i) = real_recreated_data_ifft(1:window_size_1,i).*(frame_rate_1/W0);
end
Slide 28 of 49
Matlab Code-Recreate the Modified Signal
% Get back to the original signal
for i=0:loop_time-1
reconstructed_signal((frame_rate_1*i)+1:((i+2)*frame_rate_1)) = reconstructed_signal((frame_rate_1*i)+1:((i+2)*frame_rate_1)) + truncated_recreated_data_ifft(:,i+1)';
real_reconstructed_signal((frame_rate_1*i)+1:((i+2)*frame_rate_1)) = real_reconstructed_signal((frame_rate_1*i)+1:((i+2)*frame_rate_1)) + real_truncated_recreated_data_ifft(:,i+1)';
end
Slide 29 of 49
Matlab Code-Recreate the Modified Signal
% Get a modified signal by deleting certain parts (STFT)
for i=0:(loop_time-1)/speed
modified_reconstructed_signal((frame_rate_1*i)+1:((i+2)* frame_rate_1)) = modified_reconstructed_signal((frame_rate_1*i)+1:((i+2)*frame_rate_1)) + real_truncated_recreated_data_ifft(:,i*speed+1)';
end
Slide 30 of 49
Matlab Code-Recreate the Modified Signal
% Initialize the compressed sequence (STFTM)
modified_reconstructed_signal_compressed(1:frame_rate_1+frame_rate_1/speed+1)=truncated_recreated_data_ifft(frame_rate_1-frame_rate_1/speed:window_size_1,1)';
% Get a modified signal by compressing
for i=0:(loop_time-2) modified_reconstructed_signal_compressed((frame_rate_1/speed*i)
+1:(frame_rate_1/speed*i)+window_size_1) = modified_reconstructed_signal_compressed((frame_rate_1/speed*i)+1:(frame_rate_1/speed*i)+window_size_1) + real_truncated_recreated_data_ifft(:,i+2)';
end
Slide 31 of 49
Matlab Code-Plot Results
%%%%%%%%%%%%%%%%% Plot Results%%%%%%%%%%%%%%%%
Figure; subplot(211)plot(sample_data)title('Original Speech'); v1=axis;hold on; subplot(212)plot(real(modified_reconstructed_signal))title(['STFT Synthesis w/ Speed = ',num2str(speed),'X']); v2=axis;if speed > 1 subplot(211); axis(v1) subplot(212); axis(v1)else subplot(211); axis(v2) subplot(212); axis(v2)end
Slide 32 of 49
Matlab Code-Write Sound Files
%%%%%%%%%%%%%%%%% Write sound files%%%%%%%%%%%%%%%%
wavwrite(modified_reconstructed_signal,sample_rate,nbits,'C:\Classes\ECE_5525\tea party fairbanks 2x.wav')
Slide 33 of 49
Examples Baseline Samples
STFT Sound file
STFTM Sound file
Original File
Sample Rate 2X
Sample Rate .5X
Slide 34 of 49
Examples STFT—Speed 0.5X
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
x 104
-0.4
-0.2
0
0.2
0.4
0.6Original Speech
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
x 104
-0.4
-0.2
0
0.2
0.4
0.6STFT Synthesis w/ Speed = 0.5X
Sound file
Slide 35 of 49
Examples STFT—Speed 2X
0 0.5 1 1.5 2 2.5
x 104
-1
-0.5
0
0.5
1Original Speech
0 0.5 1 1.5 2 2.5
x 104
-1
-0.5
0
0.5
1STFT Synthesis w/ Speed = 2X
Sound file
Slide 36 of 49
Examples STFT—Speed 4X
0 0.5 1 1.5 2 2.5
x 104
-1
-0.5
0
0.5
1Original Speech
0 0.5 1 1.5 2 2.5
x 104
-1
-0.5
0
0.5
1STFT Synthesis w/ Speed = 4X
Sound file
Slide 37 of 49
Examples STFTM—Speed 0.5X
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
x 104
-0.4
-0.2
0
0.2
0.4
0.6Original Speech
0 0.5 1 1.5 2 2.5 3 3.5 4 4.5
x 104
-0.4
-0.2
0
0.2
0.4
0.6STFTM Synthesis w/ Speed = 0.5X
Sound file
Slide 38 of 49
Examples STFTM—Speed 2X
0 0.5 1 1.5 2 2.5
x 104
-1
-0.5
0
0.5
1Original Speech
0 0.5 1 1.5 2 2.5
x 104
-1
-0.5
0
0.5
1STFTM Synthesis w/ Speed = 2X
Sound file
Slide 39 of 49
Examples STFTM—Speed 4X
0 0.5 1 1.5 2 2.5
x 104
-1
-0.5
0
0.5
1Original Speech
0 0.5 1 1.5 2 2.5
x 104
-1
-0.5
0
0.5
1STFTM Synthesis w/ Speed = 4X
Sound file
Slide 40 of 49
More Results
Change in window size If the window size becomes too small, then a
change in pitch will occur Need window to be 2 to 3 pitch periods long I generally used 20 – 30 ms windows
Slide 41 of 49
More Results
Change in frame rate If the frame rate decreases too much, then there will
be too many samples overlapping to get an intelligible signal
-50 0 50 100 150 200 250 300 350 400 4500
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Slide 42 of 49
More Results
Change filter type Tried Hamming—not much perceptual
difference Using the window energy becomes important
here Frame Rate/W0 is not equal to one
Slide 43 of 49
Conclusion
Optimum area Frame rate is one half of the window size Window size needs to be 2 to 3 pitch periods
long It is possible to easily change the time scale
and still maintain the original pitch although the result is not always natural sounding
Slide 44 of 49
Conclusion
Further investigation What to do when you want to slow down over
half. Using the STFTM means there will be gaps
between the sequences
0 100 200 300 400 500 600 700 800 900 10000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Slide 45 of 49
Conclusion
Further investigation What to do when you want to slow down over half
Could replicate windowed segments
0 100 200 300 400 500 600 700 800 900 10000
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
Slide 46 of 49
Conclusion
Further investigation Use the other methods to determine quality
Implement Sinusoidal Synthesis Implement Linear Predictive Synthesis using linear
prediction and homomorphic methods Work on synchronizing pitch periods
Shift samples so that the peaks line up Scott and Gerber—Synchronized Overlap and Add (SOLA) Cross-correlation of two samples to find peak Use the peaks to line up samples
Align the window at same relative location within a pitch period
Slide 47 of 49
Questions
Are there any questions?
Slide 48 of 49
References
Quatieri, Thomas E. Discrete-Time Speech Signal Processing. Prentice Hall, Upper Saddle River, NJ, 2002.
Rabiner, L.R. and Schafer, R.W. Digital Processing of Speech Signals. Prentice Hall, Upper Saddle River, NJ, 1978.
Oppenheim, A.V and Schafer, R.W. Digital Signal Processing. Prentice Hall, Englewood Cliffs, NJ, 1975.
Scott, R. and Gerber, S. “Pitch Synchronous Time-Compression of Speech,” Proc. Conf. Speech Communications Processing, p63-85, April 1972.
Slide 49 of 49
References
Fairbanks, G., Everitt, W.L., and Jaeger, R.P. “Method for Time or Frequency Compression-Expansion of Speech,” IEEE Transaction Audio and Electroacoustics, vol. AU-2 pp.7-12, Jan 1954.