General Guidelines - my.fit.edumy.fit.edu/~vkepuska/FIT/2016/KepuskaDossierDraft4c.d… · Web...

VI. AppendixesA.Final Exam Web-Based Tool

In this included example, I am using my name (“Veton Këpuska” and “Veton Këpuska 1”) for logging for final-exam under ECE3551 at two different times, e.g.: at 9:00 am which was marked with red (accepted) and 13:30 pm which is marked with blue (rejected) as indicated below.

This tool allows me to print the table of final examination times of all registered users:

This tool is password protected and I never had any problems utilizing it.One issue that I have not been able to resolve is integrating this tool with our PAWS system do to FIT policies. This integration will allow me to enter directly official grades into the system rather than transferring data to yet another tool.

1

B.Final Project Requirements

General GuidelinesThe project should exercise most if not all of the ideas covered in the lectures and/or labs. Students are encouraged to go beyond the scope of the material covered in lectures and laboratory exercises. The projects will be evaluated based on following criteria:

Understanding of the problem, Creativity Quality of implementation and execution Independent Thinking Originality Thoroughness of the work Clarity of idea(s) Presentation Documentation

DeliverablesThe deliverable at the minimum should contain:

1. Power Point Presentation Outlining the basic idea of the project and its implementation,2. Microsoft Word Document with a detailed description of the project including:

Problem Statement Literature Review of various solutions Detailed explanation of selected solution and its implementation Results Concluding Remarks. Bibliography & References

3. Complete Source Code Project including: Source Code Files Development System Project Files

Completed project should be delivered in electronic form preferably as a zip file or CD disk.

Team vs. Individual Effort

Team efforts are allowed but not encouraged. Explicit permission must be obtained from the instructor for team work. Note that team efforts would require significantly larger effort to justify the team work. In other words, individual efforts are graded differently from team effort. Expected work is directly proportional to the size of the team. Larger the team more it will be expected from them.

2

If more than one member participates in the effort, the following guidelines should be strictly observed:

1. Team should have a leader2. Contribution of each individual in the team must be made clear and explicit. 3. Individual contributions must be presented up-front and should be presented as % from

total effort for each team member.4. Each team member should present his/her own work.5. Academic Integrity and Honesty should be strictly observed.

Failure to observe Academic Integrity and Honesty policy would automatically trigger “F” grade.

Note

Failure to observe any of the guidelines may result in incomplete grade.

~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~Dr. Veton Këpuska, Associate ProfessorECE DepartmentFlorida Institute of TechnologyOlin Engineering Building150 West University Blvd.Melbourne, FL 32901-6975Tel. (321) 674-7183E-mail: [email protected]~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~

3

mailto:[email protected]

C.Speech Analysis and Special Effects Laboratory: SASE_Lab() Tool

Empty beginning screen-shot of my SASE_Lab() tool.

The next screen-shot displays the process of loading of the wave-file from a TIMIT (Texas Instruments and MIT) corpus. This is done interactively by picking the file from the directory where the corpora is located.

After selecting the region of loaded wave-form by clicking on it, the rest of windows get populated by performing various analysis as described further in detail below.

4

One additional future that was added after the final report was published is presented next. It depicts sentence with individual TIMTI phonetic transcription with boundaries, individual words, and what kind of phoneme it is.

5

Next figure depicts options provided by the View menu where I was provided two options: to “Process Speech” and “Spectrogram”. The default option is to display “Process Speech” as depicted in the figure below.

If I pick the Spectrographic display I will get the following window which depicts the spectrographic analysis performed from the original wave-form:

6

Notice the task-bar below original figure. Tis task-bar is designed to allow the user(s) of such a tool to focus instead on overall utterance on only 1 second log sections as demonstrated below:

By design, the next feature of this tool allows user(s) to apply various special effects from the list (Echo, Reverberation, Flanger, Chorus, Vibrato, Tremolo and Voice Changer) of as depicted in the next slide where by the “Reverberation” was picked and applied to original signal;

7

Documentation of SASE_Lab Project:

IntroductionSpeech processing is the application of signal processing techniques to a speech signal

for a variety of applications. A common goal between applications is to represent characteristics of the original signal as efficiently as possible. Example applications include speech recognition, speaker recognition, speech coding, speech synthesis, and speech enhancement. Efficiency is a key factor in all of these applications. These applications are covered in more detail in the motivation section of this report.

In speech recognition, the goal is to provide a sound wave containing speech to a system, and have the system recognize parts of words or whole words. To complete this goal, speech processing must first take place to prepare the speech signal for recognition. This is done by breaking down the speech signal into frames of approximately 10-30 milliseconds, and generating a feature vector that accurately and efficiently characterizes the speech signal for that frame of time. Reducing the frame from a fairly large set of speech signal samples to a much smaller set of data, the feature vector, allows for quicker computation during the recognition stage of the system.

The figure above shows a high level diagram of a typical speech recognition system. The input to the system is a speech signal and is passed to the front end of the system. The front end is responsible for the speech processing step and will extract features of the incoming speech signal. The back end of the system will use the features provided by the front end and based on statistical models, provide recognized speech. This paper will focus on the front end and related speech processing required to extract feature vectors from a speech signal.

8

Front EndFeature Extraction

Back EndRecognition

SpeechSignal

Recognized

MotivationSpeech Processing is an important technology that is used widely by many people on a

day-to-day basis. Almost all smartphones come equipped with speech recognition capabilities to enable hands free use of certain phone functionality. A recent advancement of this mobile technology is Siri on the Apple iPhone 4S. This application takes speech recognition a step further and adds machine understanding of the user’s requests. Users can ask Siri to send messages, make schedules, place phone calls, etc. Siri responds to these requests in a human-like nature, making the interaction seem like almost talking to a personal assistant. At the base of this technology, speech processing is required to extract information from the speech signal to allow for recognition and further more understanding.

Speech recognition is also commonly used in interactive voice response (IVR) systems. These systems are used to handle large call volumes in areas such as banking and credit card services. IVR systems allow interaction between the caller and the company’s computer systems directly by voice. This allows for a large reduction in operating costs, as a human phone operator is not necessary to handle simple requests by a customer. Another benefit of an IVR system is to segment calls to a large company based on the caller’s needs and route them to appropriate departments.

Other applications of speech processing and recognition focus on a hands free interface to computers. These types of applications include voice transcription or dictation systems. These can be found commercially in use for direct speech to text transcription of documents. Other hands-free interfaces allow for safer interaction between human and machines such as the OnStar system used in Chevy, Buick, GMC, and Cadillac vehicles. This system allows the user to use their voice to control navigation instructions, vehicle diagnostics, and phone conversations. Ford vehicles use a similar system called Sync, which relies on speech recognition for hands free interface to calling, navigation, in-vehicle entertainment, and climate control. These systems use of hands free interface to computing, allows for a safer interaction when the users attention needs to be focused on the task at hand, driving.

Another growing area of technology utilizing speech processing is the video game market. Microsoft released the Kinect for the Xbox 360, which is an add-on accessory to the video game console to allow for gesture/voice control of the system. While the primary focus of the device was gesture control, it uses speech processing technology to allow control of the console by the user’s voice.

9

Discrete-Time Signals

A good understanding of discrete-time signals is required prior to discussing the mathematical operations of speech processing. Computers are discrete systems with finite resources such as memory. Sound is stored as discrete-time signals in digital systems. The discrete-time signal is a sequence of numbers that represent the amplitude of the original sound before being converted to a digital signal.

Sound travels through the air as a continuously varying pressure wave. A microphone converts an acoustic pressure signal into an electrical signal. This analog electrical signal is a continuous-time signal that needs to be discretized into a sequence of samples representing the analog waveform. This is accomplished through the process of analog to digital conversion. The signal is sampled a rate called the sampling frequency (Fs) or sampling rate. This number determines how many samples per second are used during the conversion process. The samples are evenly spaced in time, and represent the amplitude of the signal at that particular time.

The above figure shows the difference between a continuous-time signal and a discrete-time signal. On the left, one period of a 200 Hz sine wave is shown. The period of this signal is the reciprocal of the frequency and in this case, five milliseconds. On the right, the signal is shown in discrete-time representation. The signal is a sequence of samples, each sample representing the amplitude of the signal at a discrete time. The sampling frequency for this

10

example was 8 kHz, meaning 8000 samples per second. The result of one period of the 200 Hz sine wave is 40 samples.

The sampling frequency is directly related to the accuracy of representation of the original signal. By decreasing the sampling rate to 2 kHz, or 2000 samples per second, the discrete-time signal loses accuracy. This can be seen on the right side of the following figure.

The exact opposite is true, by increasing the sampling frequency the signal can be represented more accurately as a discrete-time signal. The following figure uses a sampling frequency of 44.1 kHz. It can be seen that the signal on the right more accurately describes the continuous-time signal at a sampling rate of 44.1 kHz as opposed to 2 kHz or 8 kHz.

11

Common sampling rates used currently for digital media are as follows:

8kHz – Standard land-line telephone service 16kHz – Wideband land-line telephone service 44.1kHz – CD Audio Tracks 96kHz – DVD Audio Tracks

Speech Processing

The main goal of speech processing is to reduce the amount of data used to characterize the speech signal while maintaining an accurate representation of the original data. This process produces a feature vector of 13 numbers typically. The feature vector is commonly referred to as Mel-Frequency Cepstral Coefficients (MFCCs). The process of feature extraction can be broken down into several stages of mathematical operations that take place on a discrete-time signal input. The following is high level diagram of feature extraction stages.

12FramingSpeech

Signal Pre-Emphasis

Framing

The speech signal can be of any length, but for analysis, the signal must be divided in segments. Each segment, or frame, will be analyzed and a feature vector will be produced. Speech signals are typically stationary over a period of 10-30 milliseconds. Given a sampling frequency of 8 kHz, corresponding frame sizes are of 80 to 256 samples. These samples contained in the frame will be passed through all stages of the front end to produce a vector containing 13 values that characterize the speech signal during that frame.

Upon complete processing of a particular frame, the next frame should not begin where the previous one ended. To more accurately process the signal, the next frame should overlap the previous frame by some amount.

13

WindowingFourier

TransformMel Filter

LogDiscrete Cosine

TransformFeature Vector

The above figure shows 768 samples of a speech signal and also the overlapping nature of the speech frames. The blue signal is the speech signal and it can be noted this is a semi-stationary section of speech. The periodicity of the signal is clearly shown. The other signals show how this speech would be divided into five frames. Each colored curve is an analysis window that segments the speech signal into frames. Each frame is 256 samples in length, and each frame overlaps the previous by 50%, or 128 samples in this case. This insures accurate processing of the speech signal. A front end system can be described by its frame rate, or the number of frames per second that the speech signal is divided into. The frame rate of the front end also translates into the number of feature vectors produced per second due to the fact that one frame produces one feature vector.

Pre-emphasis

The next stage of the front end is to apply a pre-emphasis filter to the speech frame that has been segmented in the previous step. A pre-emphasis filter in relation to speech processing is typically a high-pass, 1st order, finite impulse response (FIR) filter. A filter modifies a signal that is passed through it.

14

FilterInput Output

A filter has a characteristic called frequency response. This describes how the filter modifies the signal passed through it. The filter used here is a high-pass, meaning that it will pass the frequencies above the cut-off frequency, while attenuating or reducing parts of the signal below the cut-off frequency. The frequency response of a common 1 st order pre-emphasis filter is shown below.

The above graph was generated using the value of -0.97 for the filter coefficient and the sampling rate of 8 kHz. The frequency response of this filter shows that the magnitude of lower frequencies are attenuated or reduced in magnitude. The opposite is also true, higher frequencies are not attenuated as much as frequencies in lower parts of the spectrum.

The reason for applying the pre-emphasis filter is tied to the characteristics of the human vocal tract. There is a roll-off in spectral energy towards higher frequencies in human speech production. To compensate for this factor, lower frequencies are reduced. This prevents the spectrum from being overpowered by the higher energy present in the lower part of the spectrum.

15

The above figure shows the original speech signal in blue and the pre-emphasized speech signal in red. While maintaining the same overall periodicity and general waveform shape, the high frequency components are accentuated. The quicker changing parts of the signal, higher frequencies, are compensated so that the lower frequency energy does not overpower the spectral results.

The operation of applying a filter to a signal is represented mathematically through the convolution operation, denoted by the ‘*’ operator.

y ( t )=x ( t )∗h (t )

The above equation convolves the input signal x(t) with filter h(t) to produce the output y(t). In a continuous time signal, the convolution operation is defined by the following integration:

y ( t )=∫−∞

t

h ( τ ) x ( t−τ ) dτ

In a discrete-time system the convolution operation changes from integration to summation.

y [n ]=∑i=0

N

β i x [n−i]

N−filter order

16

β i−filter coefficients x [n ]−input signal y [n ]−output signal

In the case of our pre-emphasis filter, the order is one. This means there will be two coefficients, β0 andβ1. The first coefficient of a FIR filter, β0, is always one. The coefficientβ1 used in the above example frequency response was the value -0.97. Expanding the above summation based on these coefficient values yields the following results.

y [n ]=β0 x [n ]+β1 x [n−1]

y [n ]=x [n ]−0.97 x [n−1 ]

The input to this function is the sequence of samples in the speech frame, x[n]. The output of this filter is the pre-emphasized (high-pass filtered) frame of speech, y[n].

Windowing

After a frame of speech has been pre-emphasized, a window function must be applied to the speech frame. While many different types of windowing functions exist, a hamming window is typically used for speech processing. The figure below shows a hamming window of length 256.

Hamming Window Function of N length:

w (n )=0.54−0.46 cos ( 2πnN−1 )

17

To apply this window function to the speech signal, each speech sample is multiplied by the corresponding value of the window function to generate the windowed speech frame. At the center of the hamming window, the amplitude is 1.0 and decays to the value 0.08 at either the beginning or end of the hamming window. This allows for the center of the speech frame to remain relatively unmodified by the window function, while samples are attenuated more, the further they are from the center of the speech frame. Observe the following figure. On the top half, a speech signal is shown in blue, with a hamming window function in red. The bottom half of the figure shows the results when the hamming window is applied to the speech signal. At the center of the frame, the speech signal is nearly the original values, but the signal approaches zero at the edges.

This process of windowing is very important to speech processing in the next stage, the Fourier Transform. If a windowing function is not applied to a speech frame, there can be large discontinuities at the edges of the frame. These discontinuities will cause problems with the Fourier Transform and will induce errors in the frequency spectrum of the framed audio signal. While it may seem like information is being lost at the edges of the speech frame due to the reduction in amplitude, the overlapping nature of sequential speech frames ensures all parts of the signal are analyzed.

Fourier Transform

The Fourier Transform is an algorithm used to transform a time domain signal into a frequency domain. While time domain gives information about how the signal’s amplitude

18

changes of time, frequency domain shows the signals energy content a different frequencies. See the following graph for an example frequency spectrum of a time domain signal. The x-axis is frequency and the y-axis is magnitude of the signal. It can be observed that this particular frequency spectrum show a concentration of energy below 1 kHz, and another peak of energy between 2.5 kHz and 3.5 kHz.

The human ear interprets sound based on the frequency content. Speech signals contain different frequency content based on the sound that is being produced. Speech processing systems analyze frequency content of a signal to recognize speech. Every frame of speech passed through the speech processing system will have the Fourier Transform applied to allow analysis in frequency domain.

The above graph shows peaks of frequency magnitude in three areas. Most speech sounds are characterized by three frequencies called formant frequencies. The formants for a particular sound are resonant frequencies of the vocal tract during that sound and contain the majority of signal energy. Analysis of formant locations in terms of frequency is the basis for recognizing particular sounds in speech.

The Fourier Transform f̂ ( ξ ) of continuous time signal f ( x ) is defined as:

19

f̂ ( ξ )=∫−∞

∞

f ( x ) ∙ e−2πixξdx , for every real number ξ

The Inverse Fourier Transform of f̂ ( ξ ) reproduces the original signal f ( x ) :

f ( x )=∫−∞

∞

f̂ (ξ ) ∙ e2 πixξ dξ , for every realnumber x

f̂ ( ξ )−continous frequency spectrum

f ( x )−continous timeinput signal

Speech processing typically deals with discrete-time signals, and the corresponding discrete Fourier Transforms are given below:

X [k ]=∑n=0

N−1

x [n ] ∙ e−i2π k

N n

x [n ]= 1N ∑k=0

N−1

X [k ] ∙ ei2 π k

N n

x [n ]−discrete time signal

X [k ]−discrete frequency spectrum

N−ℱ Transform¿¿

n−sample number

k−frequency bin number

The vector X [k ] contains the output values of the Fourier Transform algorithm. These values are the frequency domain representation of the input time domain signal, x [n ]. For each index of k , from 0 to N , the value of the vector is the magnitude of signal energy at the frequency bin k . When analyzing magnitude, the Fourier Transform returns results that are

20

symmetric across the mid-point of the FFT size. For example, if the FFT size is 1024, the first 512 results will be symmetric to the last 512 values, in terms of magnitude. See the following graph of a 1024 point Fourier Transform.

Due to this symmetry, only the first half of the Fourier Transform is used when analyzing the magnitude of frequency content of a signal. The relation from k to actual frequency depends on the sampling rate (F s) of the system. The first frequency bin, k=0, represents 0 Hz, or the overall energy of the signal. The last frequency bin, k=N /2 (512 in the above graph), represents the maximum frequency that can be detected based on the sampling rate of the system. The Nyquist-Shannon Sampling Theorem states the maximum frequency that can be detected in a discrete-time system is half of the sampling rate. Given a sampling rate of 8 kHz, the maximum frequency, or Nyquist Frequency, would be 4 kHz. This value would correspond to the 512th index of the above graph.

Each frequency bin k , represents a range of frequencies rather than a single value. The range covered by a set of frequencies is called a bandwidth(BW ). The bandwidth is defined by initial and final frequencies in the given band. For example if a frequency bin started at 200 Hz and ended at 300 Hz, the bandwidth of the bin would be 100Hz. The Fourier Transform returns frequency bins that are equally spaced from 0 Hz to the Nyquist Frequency, with each bin having the same bandwidth. To compute the bandwidth of each bin, the overall bandwidth of

21

the signal must be divided by the number of frequency bins. For example given F s=8 kHz and N=1024.

BW bin=( F s

2 )/( N2 )=( 8kHz2 )/( 1024

2 )=7.8125Hzbandwidth per bin

This is the bandwidth (BW bin) for each frequency bin, of the results from the Fourier Transform. To translate from frequency bin k to actual frequency, the bin number (k ¿ is multiplied by the bin bandwidth (BW bin). For example if BW bin=7.8125 and k=256.

f max bin=BW ∙k=7.8125 Hzbin

∙256=2000Hz

f min bin=f max bin−BW=2000Hz−7.8125Hz=1992.1875Hz

These calculations show that frequency bin 256 covers frequencies from about 1.992 kHz to 2 kHz. Note the equation for bin bandwidth is indirectly proportional to the Fourier Transform size. As the Fourier Transform size increases, the bin bandwidth decreases, thus allowing a finer resolution in terms of frequency. A finer resolution in frequency produces more accurate results in terms of the original frequency content of the signal versus the output from the Fourier Transform.

An optimization of the Fourier Transform is the Fast Fourier Transform (FFT). This is a much more efficient way to compute the Fourier Transform of a given signal. There are many different algorithms for computing the FFT such as the Cooley-Tukey algorithm. Many algorithms rely on the divide-and-conquer approach, where the overall Fourier Transform is computed by breaking down the computation into smaller Fourier Transforms. Direct implementation of the Fourier Transform is of order N 2 while the Fast Fourier Transform achieve a much lower order of N ∙ log (N ).

Another optimization used is pre-computing twiddle factors. The complex exponential from Fourier Transform definition is known as the twiddle factor. The value of the complex exponential is independent of the input signal x [n] and is always the same value for a given n , k and N . Since these values never change for a particular n and k , a table of values of size n by k can be computed ahead of time. This look-up table has every possible value of the complex exponential for a given n and k . Rather than computing the exponential every time, the algorithm ‘looks up’ the value in the table. This greatly improves the efficiency of the Fourier Transform algorithm.

22

Mel-Filtering

The next stage of speech processing is converting the output of the Fourier Transform to mel scale rather than linear frequency. The mel scale was first introduced in 1937 by Stevens, Volkman, and Newman. The mel scale is based on the fact that human hearing responds to changes in frequency logarithmically rather than linearly. The frequency of 1 kHz was used as a reference point where 1000 Hz is equal to 1000 mels. The equation relating frequency to mels is as follows:

m=2595 ∙ log10(1+ f700 )

The above graph shows the transformation function from linear scale to logarithmic scale frequency. A greater change in linear frequency is required for the same increment in mel scale as linear frequency increases.

To apply this transformation to the frequency spectrum output of the Fourier Transform stage of speech processing, a series of triangular filters must be created. Each filter will be applied to the linear frequency spectrum to generate the mel-scale frequency spectrum. The number of mel-filters is dependent on the application of the speech processing, but typically 20-40 channels are used. The graph below shows 25 mel-filters to be applied to the frequency

23

spectrum obtained from previous section. It can be observed that for each increasing filter, the bandwidth increases, covering a larger range of frequencies. The magnitude of each filter also decreases. This is due to normalization of magnitude according to the bandwidth that the filter covers.

To apply the process of mel-filtering to the frequency spectrum will result in a vector that is the same length as the number of mel filters that are applied. Each mel filter function will be multiplied to each value of the frequency spectrum and the results summed. This summation of multiplications will produce a single value corresponding to the magnitude of signal energy at a particular mel frequency. This process is repeated for each mel filter.

Every filter channel has a magnitude of zero for all values that fall outside of the triangle, thus eliminating all frequency information related to that mel filter channel that falls outside of the triangle. Frequencies that a nearest to the center of the mel filter will have the most impact on the output value, with linearly decreasing significance approach either side of the triangle. See the below figure to observe the input linear frequency spectrum and resulting mel-scale frequency spectrum.

24

The blue graph shows the frequency spectrum obtained from the previous section, the Fourier Transform. The graph below in red shows the output after converting to mel-frequency scale. It can be observed that the mel-frequency spectrum has the same overall shape as the linear scaled frequency spectrum, but the higher frequency information has been compressed together.

Mel-Frequency Cepstral Coefficients

The final two steps of speech processing produce results that are called mel-frequency cepstral coefficients or MFCCs. These coefficients form the feature vector that is used to represent the frame of speech being analyzed or processed. As mentioned before, the feature vector needs to accurately characterize the input. The two mathematical processes that need

25

to be applied after the previous steps are taking the logarithm and applying the discrete cosine transform.

The above figure shows the associated feature vector when using the mel-frequency spectrum obtained in the previous section. The first step is to take the logarithm (base 10) of each value in the mel-frequency spectrum. This is a very useful operation, as it allows the separation of signals combined through convolution. For example, if a speech signal is convolved with a noise signal such as background noise:

y (t )=x (t)∗n(t)

y ( t )=speech∧noise , x ( t )=original speech,n (t )=noise signal

By taking the Fourier Transform of both sides of the equation, the convolution operation becomes a multiplication. This is due to the convolution property of the Fourier Transform.

y ( t )=x ( t )∗n (t )

Y (ω )=X (ω)∙ N (ω)26

Then, by applying the logarithm property of multiplication, the original signal and the noise signal are mathematically added together instead of multiplied. This allows the subtraction of an undesired signal that has been convolved with a desired signal.

Y (ω )=X (ω ) ∙ N (ω)

log10 (Y (ω) )=log10 (X (ω) )+ log10 (N (ω ) )

From the last equation, if the noise signal is known, then it can be subtracted from the combined signal. After the logarithm has been taken of each value from the mel-frequency spectrum, the final stage of speech processing is to apply the discrete cosine transformation.

ℱ Transform : X [k ]=∑n=0

N−1

x [n ] ∙e−i2π k

N n

e−i2π k

N n=cos(−2πn k

N )+isin(−2πn kN )

Then drop the imaginary component and the kernel becomes:

cos (−2πn kN )

The resulting discrete cosine transform equation is:

X [k ]=∑n=0

N−1

x [n] ∙cos (−2πn kN )

This operation will result in a vector of values that have been transformed from mel-frequency domain to cepstral domain. This transformation led to the name Mel-Frequency Cepstral Coefficients, MFCCs. Most applications use only the first 13 values to form the feature vector, truncating the remaining results. The length of the feature is dependent on the application, but 13 values is sufficient for most speech recognition tasks. The discrete cosine transform is nearly identical to the Fourier transform, except that it drops the imaginary part of the kernel.

Experiments and Results

For this project, a speech processing tool was created with MATLAB to allow analysis of all the steps involved. The tool has been named SASE Lab, for Speech Analysis and Sound

27

Effects Lab. The result is a graphical user interface that allows the user to either record a signal or open an existing waveform for analysis. This signal can then be played back to hear what speech was said. The interface has six plots that show the speech signal at the various stages of speech processing. Starting at the top-left, a plot shows the entire waveform of the original signal. After having opened or recorded a signal, the user can then click on a part of the signal shown to analyze that particular frame of speech. After a frame has been selected, the other five plots show information related to that frame of speech. The top-right plot shows only the frame of speech selected, rather than the whole signal. It also shows the windowing function overlay in red. On the middle row, left side, the plot shows the frame of speech after the windowing function has been applied. To the right of that plot, the frequency spectrum of the speech frame is shown. On the bottom row, left side, the plot shows the results of converting the linear scale frequency spectrum to mel-scale frequency. The plot on the bottom-right shows the final output of the speech processing, a vector of 13 features representing the input for the particular frame being analyzed.

The above picture is a screenshot of the SASE Lab tool analyzing a particular frame of speech from a signal. The part of the original signal being analyzed is highlighted with a vertical red bar in the top-left plot.

In addition to these six plots, there are three buttons above them. These three buttons control starting a recording from a microphone, stopping a recording, and playing a recording. Along the menu bar, the user has standard options such as File, View, and Effects. The File menu allows a user to open or save a recording. The View menu allows the user to switch between the view shown above or the spectrogram view. The Effects menu allows the user to apply several different audio effects to the speech signal.

28

The other main view of the SASE Lab shows spectrogram of the signal. This displays how the frequency content of the signal changes over time. This is an important piece of information when dealing with speech and speech recognition. The human ear differentiates sounds based on frequency content, so it is important to analyze speech signals for their frequency content.

The screenshot above shows the spectrogram view of SASE Lab. On the top half of the display, the waveform of the original signal is displayed. On the bottom section, the spectrogram of the signal is displayed. The x-axis is time and the y-axis is frequency. The above screenshot is analyzing the speaker saying “a, e, i, o, u”. The frequency content of each utterance shows distinct patterns with regards to frequency content.

The spectrogram view of SASE Lab also allows the user to select a 3 second window to analyze separately in the case that a long speech signal is present. When looking at a longer speech signal, the spectrogram become crowded due to compressing a large amount of data in the same display space. See the following screenshot for an example.

29

The speech signal shown is approximately 18 seconds long and hard to analyze when showing the spectrogram of the whole signal at once. To alleviate this issue, a slider bar was implemented to allow the user to select a 3 second window of the entire speech signal. The window is shown in the top graph by the highlighted red section of the signal waveform. In a new window, the 3 second section of speech waveform is plotted, along with the corresponding section of spectrogram. See the screenshot below.

30

This figure shows only the section of speech highlighted in the previous figure. Spectral characteristics are much easier to observe and interpret compared to viewing the entire speech signal spectrogram at the same time. The user can change the position on the slider bar from the main view, and the secondary view will update its content to show the new 3 second selection. The user may also play only the 3 second selection by pressing the Play Section button on the right side of the main view. The Play All button will play the entire signal.

Analysis on the feature vector produced for different vowel sounds yields results as expected. The feature vector should be able to accurately characterize the original sound from the frame of speech. For different sounds of speech, the feature vector needs to show distinct characteristics to allow analysis and recognition of the original speech. The following five figures show how SASE Lab analyzes the vowel sounds “a, e, i, o, u”. These sounds are produced by voiced speech, in which, the vocal tract is driven by periodic pulses of air from the lungs. The periodic nature of the driving signal is characterized through the speaker’s pitch. For example, males tend to have lower pitch than females and thus have a greater amount of

31

time between each pulse of air. The pitch of the speaker can be observed on the frequency spectrum of the SASE Lab.

The above figure shows the speech waveform recorded for a speaker saying the sounds “a, e, i, o, u” (/ey/, /iy/, /ay/, /ow/, /uw/). This can be observed in the first plot that shows the entire speech signal. There are five distinct regions where there is significant amplitude data indicating sound. These five regions are surrounded by low amplitude data, showing slight pauses between each sound uttered. In the first graph, a vertical red line is seen on the first region of sound, the “a” or /ey/ sound. The placement of the vertical red line controls which frame of the entire speech signal is to be analyzed. The next plot shows a periodic signal, with approximately three complete periods. This is the frame that has been selected for analysis. The next three plots show the signal as it passes through each stage of speech processing. The final plot on the bottom right, shows the mel-frequency cepstral coefficients (MFCCs) for the particular frame being analyzed. These 13 values are called the feature vector. Note that the

32

first value of feature vector is omitted from the plot. This value contains the energy of the signal and typically is of greater magnitude than the other 12 values. It is not shown on the plot as it would cause re-scaling of the graph and the detail of the other 12 features to be lost.

The four figures below show how the feature vector differs for each sound produced. The difficulty lies in the fact that even for the same speaker, every time a particular sound is produced, there will be slight variations. This is one factor that makes speech recognition a complicated task. Current solutions require training a model for each sound. The training process entails collecting many feature vectors for a sound and creating a statistical model of the distribution of the features for the given sound. Then, when comparing an unknown feature vector to the likely distributions of features for a given sound, a probability that the unknown feature vector belongs to a known sound can be computed.

Feature Vector for “a” /ey/

Feature Vector for “e” /iy/

Feature Vector for “i” /ay/

Feature Vector for “o” /ow/

33

SummarySpeech processing entails many different aspects of mathematics and signal processing

techniques. The main goal of this process is a reduction in amount of data while maintaining an accurate representation of the speech signal characteristics. For every frame of speech, typically 10-30 milliseconds, a feature vector must be computed that contains these characteristics. An average frame size is approximately 256 samples of audio data, while the feature vector typically is only 13 values. This reduction in amount of data allows for more efficient processing of the feature vector. For example, if the feature vector is being passed along to a speech recognition process, an analysis on the feature vector will be computationally more efficient that analysis on the original frame of speech.

The outline of a speech processing system contains several stages. The first stage is to separate the speech signal into frames of 10-30 millisecond duration. Speech is typically constant over this period, and allows for efficient analysis of a semi-stationary signal. This frame of data is then passed through the other stages of the process to produce the end result, a feature vector. The next step is to apply a pre-emphasis filter to compensate for lower energy in the higher frequencies of human speech production. After this filter, the Fourier Transform of the signal is taken to compute the frequency spectrum of the speech frame. This information indicates what frequency content composed the speech signal. The frequency spectrum is then converted to a logarithmic scale through the process of mel-filtering. This step models the sound as humans perceive frequencies, logarithmically. After this step, the base 10 logarithm is taken of the resulting values from the previous step and finally, the discrete cosine transform is applied. The resulting vector is truncated to 13 values and forms the feature vector. This feature vector characterizes the type of speech sounds present in the original frame, but with the advantage of using far less data. The feature vector can then be passed along to a speech recognition system for further analysis.

The MATLAB tool created for this project, SASE Lab, performs all stages of speech processing to produce a feature vector for each frame of speech signal data. SASE Lab also shows graphs of data after each stage of the speech processing task. This breakdown of information allows the user to visualize how the data is manipulated through each step of the process. In addition to speech processing, the tool also incorporates several digital signal processing techniques to add audio effects to a speech signal. The application of these effects can then be analyzed for how they affect the feature vectors produced or at any stage of speech processing. Effects include echo, reverberation, flange, chorus, vibrato, tremolo, and modulation.

1

References

Këpuska, Dr. Veton.DiscreteTimeSignalProcessingFramework.

http://my.fit.edu/~vkepuska/ece5525/Ch2-Discrete-Time%20Signal%20Processing%20

Framework2.ppt

Këpuska, Dr. Veton.AcousticsofSpeechProduction.

http://my.fit.edu/~vkepuska/ece5525/Ch4-Acoustics_of_Speech_Production.pptx

Këpuska, Dr. Veton.SpeechSignalRepresentations.

http://my.fit.edu/~vkepuska/ece5526/Ch3-Speech_Signal_Representations.pptx

Këpuska, Dr. Veton.AutomaticSpeechRecognition.

http://my.fit.edu/~vkepuska/ece5526/Ch5-Automatic%20Speech%20Recognition.pptx

Oppenheim, Alan V., and Ronald W. Schafer. Discrete-timesignalprocessing. 3rd ed. Upper

Saddle River: Pearson, 2010.

Phillips, Charles L., and John M. Parr. Signals,systems,andtransforms. 4th ed. Upper Saddle

River, NJ: Pearson/Prentice Hall, 2008.

Quatieri, T. F.. Discrete-timespeechsignalprocessing:principlesandpractice. Upper Saddle

River, NJ: Prentice Hall, 2002.

2

D.Example of Homework for Speech Recognition ClassHomework Assignment #2

Gaussian Statistics & Statistical Pattern Recognition(Problems & MATLAB scripts adopted from EPFL Lab Notes)

(100 Points)

PrerequisiteAttached is the homework package HW#2.zip or you can get it from http://my.fit.edu/~vkepuska/ece5526/HW/HW#2.zip.Unpack the package. From MATLAB workspace change directory to newly created HW2 directory containing the data and MATLAB scripts.

Gaussian Statistics

Problem 1 (20 Points)Generate a sample vector X of N points, X={x1, x2, x3, …, xN}with N=10000 generated by a Gaussian process that has a mean:

μ=(7301090)

and variance:A. 1

Σ1=[8000 00 8000 ]

B. 2

Σ2=[8000 00 18500 ]

C. 3

Σ3=[8000 84008400 18500 ]

Use the MATLAB function gausview to plot the Gaussian distributions as scatter plot in a 2 dimensional plane view. This tool displays this data of a 2-dimensional function as a 2-D and 3-D plot.

3

Example:

>> N = 10000;>> mu = [ 730 1090];>> sigma_1=[8000 0; 0 8000];>> X1 = randn(N,2)*sqrtm(sigma_1) + repmat(mu,N,1);>> gausview(X1,mu,sigma_1, 'Sample X1');

By observing the 2D views of the data and the corresponding pdf contours how could one infer that the sample process has diagonal covariance with equal variance, diagonal covariance with different variances and finally a full covariance matrix with different (or equal) variances and equal (or different) covariance’s.

4

Sample Mean & Variance of Gaussian Model

Problem 2 (20 Points)

Using data from generated last set X3, in the Problem 1, compute an estimate mean μ̂ and

covarianceΣ̂ of the pdf of the data (assuming that it is Gaussian) as follows: A. Use only 100 available points to compute

μ̂ 100= { Σ̂ 100=¿

B. Use 1000 available points to compute

μ̂ 1000= { Σ̂ 1000 =¿

C. Use all 10000 available points to compute

μ̂ 10000= { Σ̂ 10000=¿

Compare the estimated values for mean μ̂ with the original value μ as used in Problem 1 using

Euclidean distance measure. Similarly, compare estimated covariance value Σ̂with the original

value Σ3 using matrix 2-norm of their difference (||A-B||2.which gives the similarity measure of two matrixes). See MATLAB norm function for this purpose.

Example:

>> X = X3(1:1000,:); % Could use rand function>> N = size(X,1);>> mu_1000 = sum(X)/N; % or mu_1000 = mean(X);>> sigma_1000 = (X - repmat(mu_1000, N, 1))' * (X - repmat(mu_1000, N, 1))/(N-1);>> % or sigma_1000 = cov(X);>> % Comparison of Values: Euclidean Distance>> e_mu = sqrt((mu_1000-mu) * (mu_1000-mu)');>>disp(sprintf(‘Difference of the means: %f’, e_mu);>>% Norm 2 computation of Cov. Matrix>> e_sigma = norm( sigma_1000 - sigma_3 );>>disp(sprintf(‘Difference of the covariance matrix: %f’, e_sigma);

When comparing the estimated values of μ̂ and Σ̂ to original values what can you observe?

5

Sample Likelihood with respect to a Gaussian Model


Likelihood:The likelihood of a sample point with respect to a Gaussian model, = (, ), is the probability density function of that point for a given model.

Joint Likelihood:The joint likelihood of a set of independent and identically distributed (i.i.d) points ={x1, x2, x3, …, xN} is the product of the likelihood for each point, e.g.:

p (X ,Θ )=∏i=1

N

p (x1 ,Θ )=∏i=1

N

p (x1|μ ,Σ )=∏i=1

N

g (μ ,Σ ) (x1)

Given the 4 Gaussian models:

N1 :Θ1=([7301090] , [8000 0

0 8000 ]) N2 :Θ2=([7301090] , [8000 0

0 18500 ])N3 :Θ3=([730

1090] , [8000 84008400 18500 ]) N 4 :Θ4=([270

1690] , [8000 84008400 18500 ])

Compute the following log-likelihoods for the whole sample X3 (10000 points):

log ( p (X3 ,Θ1 ) ) ,log ( p (X3 ,Θ 2 )) ,log ( p (X3 ,Θ3)) ,log ( p (X3 ,Θ 4 ))

What is the advantage of computing log-likelihood vs. just the likelihood?

Example:

>> N = size(X3,1);>> mu_1 = [730 1090]; sigma_1 = [8000 0, 0 8000];>>logLike1 = 0;>> for I = 1:N;>>logLike1 = logLike1 + (X3(i,:) - mu_1)*inv(sigma_1)*(X3(i,:) – mu_1)’;>>end>>logLike1 = -0.5 * (logLike1 + N*log(det(sigma_1)) + 2*N*log(2*pi));

Use the function gausview to compare the relative positions of the models N1, N2, N3, and N4 with respect to the data set X3, which will help explain why the different models generate a different log-likelihood.

Of N1, N2, N3, and N4, which model “explains” best the data? Which model has the highest number of parameters? Which model would you choose for a good compromise between the number of parameters and the capacity to represent accurately the data?

6

Statistical Pattern Recognition


Load data from file “vowels.mat”. This file contains a database of simulated 2 dimensional speech features in the form of artificial pairs of formant values (the first and the second spectral formants, [F1, F2]). These artificial values represent the feature that would be extracted from several occurrences of vowels /a/, /e/,/i/, /o/ and /y/. They are grouped in matrices of size Nx2, where each of the N lines is a training example and 2 is the dimension of the features (in our case, formant frequency pairs).

Supposing that the whole database covers adequately an imaginary language made on of /a/’s, /e/’s, /i/’s /o/’s and /y/’s, compute the probability P(qk) of each class qk, k∈{/a/,/e/,/i/,/o/,/y/}. From the provided data rank the phonemes based on their frequency of occurrence. What are the most common and the least common phonemes in this language?

Example:

>> clear all; close all; >> load vowels.mat; whos>> Na = size(a,1); Ne = size(e, 1); Ni = size(i, 1); No = size(o,1); Ny = size(y,1);>> N = Na + Ne + Ni + No + Ny;>> Pa = Na/N;>> Pe = Ne/N;etc.

7

Gaussian Modeling of Classes


Plot each vowel’s data as clouds of points in the 2d plane. Train the Gaussian models corresponding to each class (use MATLAB command mean and cov). Plot their contours (use directly the function plotgaus(mu, sigma, color) where color = [R, G, B]).

Example:

>> plotvow; % Plot the clouds of simulated vowel features. The figure object will be used to display Gaussian distributions. >> mu_a = mean(a); sigma_a = cov(a);>> plotgaus(mu_a, sigma_a, [0 1 1]);>> mu_e = mean(e); sigma_e = cov(e);>> plotgaus(mu_e, sigma_e, [1 0 1]);etc.

Note your results in the following table:

μ¿ a/¿= ¿ Σ¿a /¿= ¿ μ¿ e/¿= ¿ Σ¿ e /¿= ¿ μ¿i /¿= ¿ Σ¿ i /¿= ¿μ¿o /¿= ¿ Σ ¿o /¿= ¿ μ¿ y /¿= ¿ Σ¿ y /¿= ¿

8

Bayesian Classification


Useful formulas and definitions:

Bayes’ Decision Rule:Given a set of classes qk, characterized by a set of known parameters , a set of one or more speech feature vectors X (also called observations) belongs to the class which has the highest probability P(qk|X,). This probability therefore is called the a posteriori probability, because it depends on having seen the observations, as opposed to the a priori probability P(qk| ).

X∈qk if P (qk|X ,Θ )≥ P (q j|X ,Θ ) , ∀ j≠k

Bayes’ Law:Use of the likelihoods (rather then estimating the posterior probability directly):

P (qk|X ,Θ )=p (X|qk ,Θ )P (qk ,Θ )

P (X|Θ )

In case of speech features, they are considered to be equi-probable. Hence:

P (qk|X ,Θ )∝ p (X|qk ,Θ )P (qk ,Θ ) ∀k

For convenience reasons the likelihood is computation is done in log domain:

log P (qk|X ,Θ )∝ log p (X|qk ,Θ )+log P (qk ,Θ ) ∀ k

A. In the case of Gaussian models for phoneme classes, what is the meaning of the parameter vector given above?

B. What is the expression of p(X|qk,) and log p(X|qk,)?C. What is the definition of the probability p(qk|)?D. Compute Gaussian pdf’s (means and variances) for each vowel class. Since the P(qk) was

computed for each class for this fictitious language, and it is assumed that speech features are equi-probable. What is the most probable class qk for the speech feature points x=(F1, F2)T in the following table?

9

x F1 F2 logP(q/a/|x)

logP(q/e/|x)

logP(q/ia/|x)

logP(q/o/|x)

logP(q/y/|x)

Most Probable Class

1. 400

1800

2. 400

1000

3. 530

1000

4. 600

1300

5. 670

1300

6. 420

2500

Example:

Use function gloglike(point, mu, sigma) to compute likelihoods. Do not forget to add the log of the prior probability!

>> gloglike([400, 1800], mu_a, sigma_a) + log(Pa);

10

Discriminant Surfaces


A set of functions fk(x) are called discriminant functions when used to classify a sample x into one of k possible classes qk.

x∈qk if f k (x ,Θk )≥f l (x ,Θl ) , ∀ l≠k

A. What is the relationship of Bayesian classifiers and discriminant functions?B. The iso-likelihood lines for the Gaussian pdfs N(/i/, /i/) and N(/e/, /e/) are presented in

the figures below; first figure uses have different covariance matrix while in the second the same covariance matrix is used for both (e.g., N(/i/, /e/) and N(/e/, /e/)). Use colored pen to join the intersections iso-level lines that correspond to equal likelihoods.

11

C. What is the form of the surface that separates class /i/ from class /e/ when the two models have different variances? Can you explain the origin of this form? What is the surface that separates class /i/ from class /e/ when the two models have the same variances? Why is it different from the previous discriminant surface? Use mathematical derivation of discriminant functions to prove your claims.

12

Unsupervised Training


In the previous problem, the models were computed for each class /a/,/e/,/i/,/o/, and /y/ by knowing a-priori which training samples belong to which class. This approach of model computation depicts a supervised training procedure for Gaussian models. Suppose that the data that is available is not labeled with the corresponding class label. Furthermore, it is desired to separate the data into several classes without knowing a-priori which point belongs to which class. The solution to this problem is called unsupervised training. Several algorithms are available to perform unsupervised training among which are noted: the K-means, Viterbi-EM and the EM (Expectation Maximization) algorithm.All algorithms of this type are characterized by the following:

A set of models qk (not necessarily Gaussian), defined by some parameters (means, variances, priors, …);

A measure of membership, telling to which extend a data point “belongs” to a model; A “recipe” to update the model parameters in function of the membership information.

The measure of membership usually takes the form of a measure of distance or the form of a measure of probability. It replaces the missing labeling information to permit the application of standard parameter estimation techniques. It also defines implicitly a global criterion of “goodness of fit” of the models to the data, e.g.:

In the case of a distance, the models that are globally closer from the data characterize it better;

In the case of a probability measure, the models bringing a better likelihood for the data explain it better.

Table 1 summarizes the components of each of he algorithm that will be studied in the following experiments. More detail will be given in the corresponding subsections.

13

Table 1. Characteristics of some usual unsupervised clustering algorithms

Alg

orith

m

Parameters Membership measure Update method

Glo

bal c

rite

rion

K-m

eans

Mean k Euclidean Distance

dk ( xn)=√ (xn−μk )T ( xn−μk )

Find the point closest to qk(old)

then:

μk(new)= 1

N∑N

qk(old )

Leas

t Squ

ares

Vite

rbi-E

M

Mean k

Variance k Priors P(qk|)

Posterior probabilitydk ( xn)=P (qk|xn ,Θ )

¿1√2πd √|Σk|

e−1

2 (x n−μk)T Σk−1 (x n−μk)P (qk|Θ )

(

Do Bayesian classification of each data point, then:

μk(new)=1

N ∑N

qk(old )

Σk(new )=Var (qk(old ))

P (qk(old )|Θ(new) )=Number of Points in qk

(old )

Total Number of Training Points Max

imum

Lik

elih

ood

EM

Mean k

Variance k Priors P(qk|)

Posterior probabilitydk ( xn)=P (qk|xn ,Θ )

¿1√2πd √|Σk|

e−1

2 (x n−μk)T Σk−1 (x n−μk)P (qk|Θ )

Compute:P (qk(old )|xn ,Θ(old )) , then

μk(new )=

∑n=1

N

xnP(qk(old )|xn ,Θ(old ) )

∑n=1

N

P (qk(old )|xn ,Θ(old ) )

Σk(new )=

∑n=1

N

P (qk(old )|xn ,Θ(old )) (xn−μk(new )) ( xn−μk(new ))T

∑n=1

N

P (qk(old )|xn ,Θ(old ))

P (qk(new )|Θ(new ))=1N ∑n=1

N

P (qk(old )|xn ,Θ(old ))

Max

imum

Lik

elih

ood

14

A. K-Means Algorithm

Do:[1] For each data point xn, n=1,…<N, compute the squared Euclidean distance from

the kth prototype:dk ( xn)=‖xn−μk‖2

¿ (xn−μk ) (xn−μk )T[2] Assign each data-point xn to its closest prototype n, i.e. assign xn to the class qk if:

dk ( xn)≤d l (xn) , ∀ l≠kNote: Using the square of the Euclidean distance for the classification gives the same result as using the true Euclidean distance, sicne the quare root is a monotonically growing function. But the computational load is obviously lighter when the square root is dropped.

[3] Replace each prototype with the mean of the data-points assigned to the corresponding class;

[4] Go to [1].

Until: no further change occurs.

The global criterion that is being minimized in the presented algorithm is total squared distance between the data and the corresponding models:

J=∑k=1

K

∑x n∈qk

dk ( xn)

KMEANS K-means algorithm exploration tool provided with this experiment:% KMEANS K-means algorithm exploration tool%% Launch it with KMEANS(DATA,NCLUST) where DATA is the matrix% of observations (one observation per row) and NCLUST is the% desired number of clusters.%% The clusters are initialized with a heuristic that spreads% them randomly around mean(DATA) with standard deviation% sqrtm(cov(DATA)).%% If you want to set your own initial clusters, use% KMEANS(DATA,MEANS) where MEANS is a cell array containing% NCLUST initial mean vectors.%% Example: for two clusters% means{1} = [1 2]; means{2} = [3 4];% kmeans(data,means);%

Launch the tool with the data sample allvow, which is part of the vowels.mat and contains complete simulated vowels data. Use MATLAB load command to load the data form vowels.mat - load(‘vowels.mat’). Perform several runs of kmeans algorithm with different cases of initialization of the algorithm:

1. 5 initial clusters determined according to the default herusitic;2. Some initial MEANS values equal to some data points;3. Some initial Means values equal to {/a/, /e/,/i/,/o/,/y/}

15

In each case, iterate until the algorithm converges. Observe the evolution of the cluster centers, of the data-points attribution chart and of the total square Euclidean distance. Note that it is possible to zoom in the generated plots by left-clicking inside the axes to zoom 2x centered on the point under the mouse; right click to zoom out; click and drag to zoom into tan area; double click to reset the fiture to the original. Observe the mean values found after the convergence of the algorithm.

Example:>> kmeans(allvow,5);>> for k=1:5, disp(kmeans_result_means{k}); end

1. Does the final solution depend on the initial state of the algorithm?2. Describe the evolution of the total squared Euclidean distance?3. What is the nature of the discriminant surfaces corresponding to a minimum Euclidean distance

classification scheme?4. Is the algorithm suitable for fitting Gaussian clusters?

16

B. Viterbi-EM Algorithm for Gaussian Clustering

Start from K initial Gaussian models N(k, k), k=1,…,K, characterized by the set of parameters (i.e., the set off all means and variances (k, k), k=1,…,K). Set the initial prior probabilities P(qk) to 1/K (equal likely probabilities).

Do:1. Classify each data point using Bayes’ rule.

This step is equivalent to having a set Q of Boolean hidden variables that give a labeling of the data by taking the value 1 (belongs) or 0 (does not belong) for each class qk and each point xn. The value of Q that maximizes p(X,Q|) precisely tells which is the most probable model for each point of the whole set X of training data.Hence, each data point is assigned to its most probable cluster qk

(new).

2. Update the parameters:o Update the means

μk(new)= 1

N ∑i∈qk

( old)

x i

o Update the variances

Σk(new )= 1

N ∑i∈qk

( old )( xi−μk(new )) ( xi−μk(new))

T

o Update the priors

P (qk(new )|Θ(new ))=Number of Training Points Belonging to qk(old )

Total Number of Training Points3. Go to 1

Until no further change

A global criterion is defined as follows:

ℑ (Θ )=∑XP (X|Θ )=∑

Q∑Xp (X ,Q|Θ )=∑

k=1

K

∑xn∈qk

log p ( xn|Qk )Represents the joint likelihood of the data with respect to the models they belong to. This criterion is locally optimized by the algorithm

Viterbi-EM explorer utility:

% VITERB Viterbi version of the EM algorithm%% Launch it with VITERB(DATA,NCLUST) where DATA is the matrix% of observations (one observation per row) and NCLUST is the% desired number of clusters.

17

%% The clusters are initialized with a heuristic that spreads% them randomly around mean(DATA) with standard deviation% sqrtm(cov(DATA)). Their initial covariance is set to cov(DATA).%% If you want to set your own initial clusters, use% VITERB(DATA,MEANS,VARS) where MEANS and VARS are cell arrays% containing respectively NCLUST initial mean vectors and NCLUST% initial covariance matrices. In this case, the initial a-priori% probabilities are set equal to 1/NCLUST.%% To set your own initial priors, use VITERB(DATA,MEANS,VARS,PRIORS)% where PRIORS is a vector containing NCLUST a priori probabilities.%% Example: for two clusters% means{1} = [1 2]; means{2} = [3 4];% vars{1} = [2 0;0 2]; vars{2} = [1 0;0 1];% viterb(data,means,vars);%Apply Viterbi-EM algorithm to allvow dataset. Perform several runs with different cases of initialization of the algorithm:

1. 5 initial clusters determined according to the default heuristic;2. Use some initial MEANS values equal to some data points, and some random

VARS values (try for instance cov(allvow) for all the classes);3. The initial MEANS, VARS, values and PRIORS values found by the K-means

algorithm.4. Some initial MEANS values equal to {/a/, /e/, /i/, /o/, /y/}, and some VARS

values equal to {/a/, /e/, /i/, /o/, /y/}, and PRIORS values equal to [P/a/, P/e/, P/i/, P/o/,

P/y/];5. Some initial MEANS and VARS values at your discretion.

Iterate the algorithm until it converges. Observe the evolution of the clusters, of the data points attribution chart and of the total likelihood curve. Observe the mean, variance and priors values found after the convergence of the algorithm. Compare them with the values computed in with supervised training algorithm.

Example:>> viterb(allvow,5);>> % push iterate until conv. buttonVITERB: resulting means, variances and priors are now stored in the workspace variables viterb_result_means, viterb_result_vars and viterb_result_priors.>> % To see the resulting means, variances and priors.>> for k=1:5, disp(viterb_result_means{k}); end>> for k=1:5, disp(viterb_result_vars{k}); end>> for k=1:5, disp(viterb_result_priors(k)); end

18

1. Does the final solution depend on the initialization of the algorithm?2. Describe the evolution of the total likelihood. Is it monotonic?3. In terms of optimization f the likelihood, what does the final solution correspond to?4. What is the nature of the discriminant surfaces corresponding to the Gaussian

classification?5. Is the algorithm suitable for fitting Gaussian clusters?

19

C. EM Algorithm for Gaussian Clustering Start from K initial Gaussian models N(k, k), k=1, …,K with equal priors set to

P(qk)=1/K. Do:

1. Estimation Step: Compute the probability P (qk(old )|xn ,Θ(old )) for each data

point xn that belongs to the class qk(old )

:

P (qk(old )|xn ,Θ(old ))=P (qk(old )|Θ(old )) p ( xn|qk(old ) ,Θ(old )) p (xn|Θ(old ))

¿P (qk(old )|Θ(old )) p ( xn|μk(old ) , Σk(old ) )∑jP (q j(old )|Θ(old )) p ( xn|μ j(old ) , Σ j

(old ))This step is equivalent to having a set Q of continuous hidden variables, taking values in the interval [0,1], that give a labeling of the data by telling to which extend a point xn belongs to the class qk. This represents a soft classification, since a point can belong, e.g., by 60% to class 1 and by 40% to class 2 (think of Schrödinger’s cat which is 60% alive and 40% dead as long as nobody opens the box or performs Bayesian classification).

2. Maximization step: o Update the means

μk(new)=

∑n=1

N

xnP (qk(old )|xn ,Θ(old ))

∑n=1

N

P (qk(old )|xn ,Θ(old ))o Update the variances

Σk(new )=

∑n=1

N

P (qk(old )|xn ,Θ(old )) ( xn−μk(new )) ( xn−μk(new ))T

∑n=1

N

P (qk(old )|xn ,Θ(old ))o Update the priors

P (qk(new )|Θ(new ))= 1N∑n=1

N

P (qk(old )|xn ,Θ(old ))In this algorithm all the data points participate to the update of all the models;

however, their effect is weighted by the value of P (qk(old )|xn ,Θ(old )) .3. Go to 1

Until the total likelihood increase for the training data falls under some desired threshold.

A global criterion is defined as follows:

20

ℑ (Θ )=log p ( X|Θ )=log∑Qp (X|Θ )

¿ log∑QP (Q|X ,Θ ) p (X|Θ ) (Bayes )

¿ log∑k=1

K

P (qk|X ,Θ ) p ( X|Θ )

Applying (Jensen’s) inequality:

log∑jλ j y j≥∑ λ j log y j if ∑ λ j=1

The following is obtained:

ℑ (Θ )≈∑k=1

K

P (qk|X ,Θ ) log p ( X|Θ )

¿∑k=1

K

∑n=1

P (qk|xn ,Θ ) log p ( xn|Θ )

Hence, the final J represents a lower boundary for the joint likelihood of all the data with respect to all the models. This criterion is locally maximized by the algorithm.

Use of explored EM utility:

% EMALGO EM algorithm explorer%% Launch it with EMALGO(DATA,NCLUST) where DATA is the matrix% of observations (one observation per row) and NCLUST is the% desired number of clusters.%% The clusters are initialized with a heuristic that spreads% them randomly around mean(DATA) with standard deviation% sqrtm(cov(DATA)*10). Their initial covariance is set to cov(DATA).%% If you want to set your own initial clusters, use% EMALGO(DATA,MEANS,VARS) where MEANS and VARS are cell arrays% containing respectively NCLUST initial mean vectors and NCLUST% initial covariance matrices. In this case, the initial a-priori% probabilities are set equal to 1/NCLUST.%% To set your own initial priors, use VITERB(DATA,MEANS,VARS,PRIORS)% where PRIORS is a vector containing NCLUST a priori probabilities.%% Example: for two clusters

21

% means{1} = [1 2]; means{2} = [3 4];% vars{1} = [2 0;0 2]; vars{2} = [1 0;0 1];% emalgo(data,means,vars);%

As in previous problems make use of the same data set allvow. Perform several runs with different cases of initialization of the algorithm:

1. 5 clusters determined according to the default heuristics.2. Some initial MEANS values equal to some selected data points, and some VARS

initialized to random values (e.g. cov(allvow) for all the classes).3. Initial MEANS and VARS values set by K-means algorithm. 4. Some initial MEANS values equal to {/a/, /e/, /i/, /o/, /y/}, and some VARS values

equal to {/a/, /e/, /i/, /o/, /y/}, and PRIORS values equal to [P/a/, P/e/, P/i/, P/o/, P/y/];5. Some initial MEANS and VARS values at your discretion.6. Increase number of initial clusters and observe the behavior of the algorithm.

Run several iterations of the algorithm until asymptotic convergence is achieved. Observe the evolution of the clusters and of the total likelihood curve. In the EM case, the data points attribution chart is not given because each data point participates to the update of each cluster. Note the mean, variance and prior values found after the convergence of the algorithm. Compare them with the values found with Gaussian modeling in the Problem 5.

Example:

>> emalgo(allvow, 5);>> % EMALGO: resulting means, variances and priors are now stored in the workspace variables em_result_means, em_result_vars and em_result_priors.>> % or >> means = {mu_a, mu_e, mu_i, mu_o, mu_y};>> vars = {sigma_a, sigma_e, sigma_i, sigma_o, sigma_y};>> emalgo(allvow, means, vars);>> % to display the resulting means, vars and priors.>> for k=1:5, disp(em_result_means{k}); end>> for k=1:5, disp(em_result_vars{k}); end>> for k=1:5, disp(em_result_priors(k)); end>>

1. Does the final solution depend on the initialization of the algorithm?2. Describe the evolution of the total likelihood. Is it monotonic?3. In terms of optimization f the likelihood, what does the final solution correspond to?4. Is the algorithm suitable for fitting Gaussian clusters?

22

E. Example of Submitted Assignment.ECE 5526

Homework Assignment 2

Gaussian Statistics & Statistical Pattern Recognition

MITESHKUMAR PATEL

03/24/2017

Problem 1: Problem 1: Generate a sample vector X of N points, X={x1, x2, x3, …, xN} with N=10000 generated by a Gaussian process that has a mean: µ1090; 730] and variance:

A. 1 0 8000; 8000 0]

B. 2 0 18500; 8000 0]

C. 38400 18500; 8000 8400]

Use the MATLAB function gausview to plot the Gaussian distributions as scatter plot in a 2 dimensional plane view. This tool displays this data of a 2-dimensional function as a 2-D and 3-D plot.

Answer:

A. >> N = 10000; >> mu = [730 1090];

>> sigma_1 = [8000 0; 0 8000];

>> X1 = randn(N,2) * sqrtm(sigma_1) + repmat(mu,N,1);

>> gausview(X1,mu,sigma_1,'Sample X1');

Here after computing this value we will get plot of Gaussian distribution in 2D and 3D view.

23

Figure 1 P1_A

Figure 2 P1_A_3D

B. >> N = 10000; >> mu = [730 1090];

>> sigma_2 = [8000 0; 0 18500];



24


Figure 3 P1_B

Figure 4 P1_B_3D

C. >> N = 10000; >> mu = [730 1090];

>> sigma_3 = [8000 8400;8400 18500];



25


Figure 5 P1_C

Figure 6 P1_C_3D

The shape of the pdf contours allows one to describe the covariance. The first image has diagonal covariance with equal variance, meaning the resulting red loops should be circles, not just ellipses. The diagonal values of the 2x2 matrix correspond to the independent X and Y variances. When these are equal and the other values are zero, the sample values will be equally likely to occur, varying in the X and Y directions equally. The third sample has diagonal covariance with different variances because it is aligned with the x- and y-axis and the red loops form non-circular ellipses (the variances are different, resulting in different radii). The PDF increased the value of the Y variance. This results in a wider spread of data values in the Y direction, however, the PDF contours are still symmetrical about both axes. The fifth sample has a full covariance matrix with different variances and different covariance’s. This is from the fact that the red loops are noncircular ellipses (different variances) and non-aligned (different covariance). The symmetry of the above PDF’s is a result of the values in each random process’

26

covariance matrix. The process with a fully populated, non-zero matrix exhibits variation that occurs along both directions. When both of the non-diagonal entries are equal, the variation appears to be along the X-Y diagonal, as show in Figure.

Problem 2: The estimations of the third random process in Problem 1 were utilized to acquire the accompanying qualities. The first N value of the sample data was used in the estimate calculation. The original values from the procedure were:

µ = [730 1090]

Ʃ3 = [8000 8400; 8400 18500]

MATLAB:

For 10000 points

>> X = X3(1:10000,:);

>> N = size(X,1);

>> mu_10000 = sum(X)/N;

>>sigma_10000 = (X - repmat(mu_10000, N, 1))' * (X - repmat(mu_10000, N, 1))/(N-1);

>> e_mu = sqrt((mu_10000-mu) * (mu_10000-mu)');

>> disp(sprintf('Difference of the means: %f', e_mu));

Difference of the means: 2.712609

>> e_sigma = norm( sigma_10000 - sigma_3 );

>> disp(sprintf('Difference of the covariance matrix: %f', e_sigma));

Difference of the covariance matrix: 232.969014

For 1000 points

>> X = X3(1:1000,:);

>> N = size(X,1);

>> mu_1000 = sum(X)/N;

>> sigma_1000 = (X - repmat(mu_1000, N, 1))' * (X - repmat(mu_1000, N, 1))/(N-1);

> >> e_mu = sqrt((mu_1000-mu) * (mu_1000-mu)');


27





For 100 points

>> X = X3(1:100,:);

>> N = size(X,1);

>> mu_100 = sum(X)/N;

>> sigma_100 = (X - repmat(mu_100, N, 1))' * (X - repmat(mu_100, N, 1))/(N-1);

>> e_mu = sqrt((mu_100-mu) * (mu_100-mu)');






From the answer of MATLAB function we get output as follow

Points (N) Est. Mean Est. Covariance Mean

Distance Covariance matrix norm

10000 [729.1 1087.0] [7724 8113 ; 8113 18259] 3.1336 545.8393

1000 [730.6 1085.7] [7282 7450 ; 7450 17229] 4.3337 1983.7

100 [729.8 1077.6] [7294 8236 ; 8236 17103] 12.4494 1434.2

10 [725.1 1051.9] [8241 13384 ; 13384 26446] 38.4163 10392

The most obvious and important trend is as N reduce, the distance measures tend to rise. Having a greater amount of sample data available will lead to estimated mean and covariance that will be much closer to the true values input to the process.

Problem 3:

28

The likelihood of a sample point with respect to a Gaussian model, = (, ), is the probability density function of that point for a given model.

A. N1 = [730 1090]; = [8000 0; 0 8000];

Ans: >>N = size(X3,1);

>>mu_1 = [730 1090]; sigma_1 = [8000 0; 0 8000];

>>logLike1 = 0;

>>for I = 1:N

>>logLike1 = logLike1 + (X3(I,:) - mu_1)*inv(sigma_1)*(X3(I,:) - mu_1)';

>>end;

>> logLike1 = -0.5 * (logLike1 + N*log(det(sigma_1)) + 2*N*log(2*pi))

>> logLike1 =

-1.2487e+005

>> gausview(X3,mu_1,sigma_1,'N1');

Figure 7 P3_A

B. N2 = [730 1090];

29

= [8000 0; 0 18500];

Ans: >> N = size(X3,1); mu_1 = [730 1090]; sigma_1 =

[8000 0; 0 18500]; logLike1 = 0;

for I = 1:N logLike1 = logLike1 + (X3(I,:) -

mu_1)*inv(sigma_1)*(X3(I,:) - mu_1)'; end; logLike1 = -0.5 *

(logLike1 + N*log(det(sigma_1)) + 2*N*log(2*pi))

gausview(X3,mu_1,sigma_1,'N2'); logLike1 =

-1.2244e+005

Figure 8 P3_B

C. N2 = [730 1090];

= [8000 8400; 8400 18500];

Ans: >> N = size(X3,1); mu_1 = [730 1090]; sigma_1 = [8000 8400; 8400

18500]; logLike1 = 0; for I = 1:N logLike1 = logLike1 + (X3(I,:) -




-1.1908e+005 30

Figure 9 P3_C

D. N2 = [270 1690];

= [8000 8400; 8400 18500];

Ans: >> N = size(X3,1); mu_1 = [270 1690]; sigma_1 = [8000

8400; 8400 18500];

logLike1 = 0; for I = 1:N logLike1 = logLike1 + (X3(I,:) -




-8.5715e+005

31

Figure 10 P3_D

The following are the computed joint log-likelihoods for the X_3 and N models: N_1 N_2 N_3 N_4 X3 Sigma Log likelihood

-1.2487e+005 -1.2244e+005 -1.1908e+005 -8.5715e+005

The advantage of computing log-likelihood instead of just the likelihood avoids the problem of underflow associated with extremely small numbers being represented in a computer. N3 best fits the data. The above show that the N_4 does the worst job estimating the likelihood of X_3. This is due to the drastic difference in mean value. N_3 does the best job of estimating, because the model parameters match the input parameters of X_3 the closest. However, it does require a fully populated covariance matrix to achieve this. N_2 would be a good compromise, because only the diagonal of the co-variance matrix is non-zero and the log-likelihood is close to that of M3. N3 and N4 have the highest number parameters. If I only wanted to use four parameters, I would have to choose between N1 and N2, of which N2 is more accurate (its log likelihood is closer to zero). The log values are preferred because it turns the expensive multiplication operation into a summation.

32

Problem 4: >> load vowels.mat; whos

Name Size Bytes Class

a 5000x2 80000 double array

allvow 20000x2 320000 double array e 6000x2 96000 double array

i 5000x2 80000 double array

o 3000x2 48000 double array

y 1000x2 16000 double array

Grand total is 80000 elements using 640000 bytes

Na = size(a,1); Ne = size(e, 1); Ni = size(i, 1); No = size(o,1); Ny = size(y,1);

N = Na + Ne + Ni + No + Ny;

Pa = Na/N

Name Size Bytes Class Attributes

I 1x1 8 double

N 1x1 8 double

X 1000x2 16000 double

X2 10000x2 160000 double

X3 10000x2 160000 double

a 5000x2 80000 double

allvow 20000x2 320000 double

ans 1x1 8 double

e 6000x2 96000 double

e_mu 1x1 8 double

i 5000x2 80000 double

logLike1 1x1 8 double

33

mu 1x2 16 double

mu_1 1x2 16 double

mu_1000 1x2 16 double

o 3000x2 48000 double

sigma_1 2x2 32 double




y 1000x2 16000 double

>> Pa = Na/N

Pa = 0.2500

>> Pe = Ne/N

Pe = 0.3000

>> Pi = Ni/N

Pi = 0.2500

>> Po = No/N

Po = 0.1500

>> Py = Ny/N

Py = 0.0500

The training vowel data was loaded into Matlab and the probabilities of each class were computed:

A E I O y P(x) 0.25 0.30 0.25 0.15 0.25 Rank 2 1 2 3 2

The sum of all is 1, as expected. Given this training data set, we would expect the vowel “E” to be the most likely to occur. The “O” phoneme is the least likely.

34

Problem: 5 Plot each vowel’s data as clouds of points in the 2d plane. Train the Gaussian models corresponding to each class (use MATLAB command mean and cov). Plot their contours >> plotvow;

>> mu_a = mean(a); sigma_a = cov(a);

>> plotgaus(mu_a, sigma_a, [0 1 1]);

35

Figure 11 contour A

>> mu_e = mean(e); sigma_e = cov(e);

>> plotgaus(mu_e, sigma_e, [1 0 1]);

36

Figure 12 contour 'e'>> mu_i = mean(i); sigma_i = cov(i);

>> plotgaus(mu_i, sigma_i, [1 0 1]);

Figure 13 contour 'i'

>> mu_o = mean(o); sigma_o = cov(o);

>> plotgaus(mu_o, sigma_o, [1 0 1]);

Figure 14 contour 'o' 37

>> mu_y = mean(y); sigma_y = cov(y);

>> plotgaus(mu_y, sigma_y, [1 0 0]);

Figure 15 contour 'y'

After this we plot all contour in one figure,

Figure 16 contour "a,e,i,o,y"

>> mu_a = mean(a), sigma_a = cov(a)

38

mu_e = mean(e), sigma_e = cov(e)

mu_i = mean(i), sigma_i = cov(i)

mu_o = mean(o), sigma_o = cov(o)

mu_y = mean(y), sigma_y = cov(y)

mu_a =

1.0e+003 *

0.7295 1.0878

sigma_a =

1.0e+004 *

0.1632 0.5202

0.5202 5.2276

mu_e =

1.0e+003 *

0.5316 1.8449

sigma_e =

1.0e+004 *

1.4494 0.7282

0.7282 3.6447

mu_i =

1.0e+003 *

0.2708 2.2922

sigma_i =

1.0e+004 *

0.2603 0.1138

0.1138 3.5310

mu_o =

569.3860 842.6966

sigma_o =

39

1.0e+004 *

0.1977 0.3570

0.3570 2.0690

mu_y =

1.0e+003 *

0.4416 1.0229

sigma_y = 1.0e+004 *

0.7894 0.8770

0.8770 1.9333 Mu Sigma A 729.5 1087.8 163.2 520.2

520.2 522.76 E 531.6 1844.9 1449.4 728.2

728.2 3644.7 I 270.8 2292.2 260.3 113.8

113.8 3531.0 O 569.4 842.7 197.7 357.0

357.0 2069.0 Y 441.6 1022.9

789.4 877.0 877.0 1933.3

Problem: 6 (Bayesian Classification) A. In the case of Gaussian models for phoneme classes, what is the meaning of the parameter vector Q given above?

ANS: - The parameter vector Q is the data set given for the classes.

B. What is the expression of p(X|qk,Q) and log p(X|qk,Q)? ANS: - The expression p(X|qk, Q) means the probability of a data point being in a specific class, given all the data points; log p(X|qk, Q) is the log of the same probability, which takes care of a possible underflow problem.

C. What is the definition of the probability p(qk|Q)?

ANS: - The expression p (qk| Q) is the probability that a class is one of the classes in the data set.

40

D. Compute Gaussian pdf’s (means and variances) for each vowel class. Since the P(qk) was computed for each class for this fictitious language, and it is assumed that speech features are equi-probable. What is the most probable class qk for the speech feature points x=(F1, F2)T in the following table?

ANS:

>> gloglike([400, 1800], mu_a, sigma_a) + log(Pa);

gloglike([400, 1000], mu_a, sigma_a) + log(Pa);





>> gloglike([400, 1800], mu_e, sigma_e) + log(Pe);

gloglike([400, 1000], mu_e, sigma_e) + log(Pe);





>> gloglike([400, 1800], mu_i, sigma_i) + log(Pi)

gloglike([400, 1000], mu_i, sigma_i) + log(Pi)





>> gloglike([400, 1800], mu_o, sigma_o) + log(Po)

gloglike([400, 1000], mu_o, sigma_o) + log(Po)


41




>> gloglike([400, 1800], mu_y, sigma_y) + log(Py)

gloglike([400, 1000], mu_y, sigma_y) + log(Py)





X F1 F2 log

P(q/a/|x)

log

P(q/e/|x)

log

P(q/i/|x)

log

P(q/o/|x)

log

P(q/y/|x)

Most

Probable

Class

1. 400 1800 -88.9257 -13.6356 -19.9132 -75.5643 -49.3573 e

2. 400 1000 -58.3930 -22.8786 -41.7136 -27.0938 -14.0427 y

3. 530 1000 -28.5610 -23.8978 -53.6589 -14.5352 -15.1644 o

4. 600 1300 -22.7700 -18.3107 -51.7360 -18.2221 -16.0276 y

5. 670 1300 -15.5094 -19.4504 -62.5418 -17.5389 -17.2378 a

6. 420 2500 -122.0943 -21.1723 -16.9512 -148.2887 -131.4180 i

Problem: 7 (Discriminant Surfaces)

A. What is the relationship of Bayesian classifiers and discriminant functions? ANS: Discriminant functions know the number and types of classes; Bayesian classifiers are not limited like this and will calculate what it finds as the optimal classifications.

42

B. The iso-likelihood lines for the Gaussian pdfs N(m/i/, S/i/) and N(m/e/, S/e/) are presented in the figures below; first figure uses have different covariance matrix while in the second the same covariance matrix is used for both (e.g., N(m/i/, S/e/) and N(m/e/, S/e/)). Use colored pen to join the intersections for the level lines that correspond to equal likelihoods. ANS:

43

C. What is the form of the surface that separates class /i/ from class /e/ when the two models have different variances? Can you explain the origin of this form? What is the surface that separates class /i/ from class /e/ when the two models have the same variances? Why is it different from the previous discriminant surface? Use mathematical derivation of discriminant functions to prove your claims. ANS: The first graph has a quadratic discriminating line due to the different curvatures of the resulting ellipses from the class data. The second graph with equal variances has a linear discriminating line because the two ellipses curve at the same rate as each other.

Problem: 8 (Unsupervised Training) A. K-Means

>> kmeans(allvow,5);

>> for k=1:5, disp(kmeans_result_means{k}); end

587.1563 837.3198

1.0e+003 *

0.5137 1.9925

1.0e+003 *

0.2729 2.3376

1.0e+003 *

0.7080 1.1771

1.0e+003 *

0.5147 1.6901

Default Heuristic with Five clusters:

44

Initial Mean values for clusters equal to some random data points from allvow

45

Initial mean values set to mean values found in problem (5):

46

1. The final solution of the k-means algorithm run does not depend on the initial state. With each mode used above, the resulting solution was the same. The amount of iterations required did vary, however.

2. The cumulated distance J is being minimized throughout each iteration. There are periods during the algorithm in which the distance is reduced more quickly. It is however, a monotonically decreasing function.

47

3. For the Euclidean distance metric, the discriminant surface is the straight line between the mean values of each cluster. The slope of the line is determined by the X and Y position of cluster centers relative to one another.

4. This algorithm is not as suitable for fitting Gaussian mixture clusters as some of the later ones. One reason is that covariance for each class is not specified. This leads to the hard decision boundaries that do not fit most Gaussian models that could have very asymmetric covariance matrices.

B. Viterbi-EM >> viterb(allvow,5); >> for k=1:5, disp(viterb_result_means{k}); end for k=1:5, disp(viterb_result_vars{k}); end for k=1:5, disp(viterb_result_priors(k)); end 538.9211 881.1274 1.0e+003 * 0.5332 1.8441 1.0e+003 * 0.2708 2.2920 1.0e+003 * 0.7316 1.0971 1.0e+003 * 0.1238 2.0389 1.0e+004 * 0.6458 -0.0269 -0.0269 2.6115 1.0e+004 * 1.4030 0.8015 0.8015 3.5702 1.0e+004 * 0.2529 0.1273 0.1273 3.4751 1.0e+004 * 0.1415 0.4371 0.4371 4.9503 1.0e+003 * 0.4179 -1.8479 -1.8479 8.6449 0.2045 0.2977 0.2513 0.2460 4.5000e-004

48

Five default clusters:

Initial Mean values for clusters equal to sample data values.

Initial Cov values for clusters equal to cov(allvow).

49

Initial means, vars, and priors found in k-means

50

Values from the model in problem (5) and a priori probabilities of all classes from problem (3)

51

1. Yes, in all cases, the resulting class characteristics were different. This is because the Bayesian classification assigns each point to its most probable cluster and then updates the new means and variances of the classes in use. 2. The changes to the overall likelihood metric are a monotonically increasing function. When the initial class parameters closely match the actual parameters, the likelihood will rise sharply over the first iterations. In some cases, such as when the final parameters from K-Means were used, there were two periods of rapid growth. In all cases, when the likelihood ceases to grow and levels off, convergence is approached.

52

3. The likelihood function measures the likelihood of the data given their respective models. When this function is optimized to the final solution, the sum probability of each datum being classified correctly is maximized. 4. Because the covariance of each model is varied, the discriminant surface will by some type of hyperbolic or parabolic surface. 5. This algorithm is suitable for Gaussian mixture class modeling because the model will adapt and optimize the covariance of each class.

C. EM-Algorithm

>> emalgo(allvow, 5); The following simulations were run until there were no visual changes in the model parameters. This was necessary because this algorithm is run “until the likelihood increase falls under some desired threshold” or “asymptotic convergence” was reached.

Five default clusters:

53

Initial Mean values for clusters equal to sample data values.

Initial Cov values for clusters equal to cov(allvow).

Initial means, vars, and priors found in k-means

54

Values from the model in problem (5)

55

TEN clusters used:

56

1. Yes, the final class distribution was different in each simulation. This is due to the fact that different a priori probabilities can be specified as well as the model parameters that will effect the estimation step for each iteration. Also, because all data points participate in all model updates (if only slightly due to weighting).

2. As with the other algorithms, the global criterion for EM-algorithm is a monotonically increasing function. However, the log-likelihood for this algorithm tended to not plateau for extended iterations (this was most visible in K-Means estimation). The convergence tended to occur much more quickly with this approach.

3. This algorithm is suitable for Gaussian mixture class modeling because the model will adapt and optimize the covariance of each class, as the Viterbi algorithm did as well.

57

General Guidelines - my.fit.edumy.fit.edu/~vkepuska/FIT/2016/KepuskaDossierDraft4c.d… · Web...

Documents

Transcript of General Guidelines - my.fit.edumy.fit.edu/~vkepuska/FIT/2016/KepuskaDossierDraft4c.d… · Web...