Automatic subtitle generation

Ihr Logo

Supervisor: Submitted by:

K. Rajalakshmi Tanya Saxena(10503894)

Abhinav Mathur(10503858)

MAJOR PROJECT

Automatic Subtitle Generation from Videos

Video has become one of the most popular multimedia artefacts used on PCs and the Internet. In a

majority of cases within a video, the sound holds an important place. From this statement, it appears essential to

make the understanding of a sound video available for people with auditory problems as well as for people with

gaps in the spoken language. The most natural way lies in the use of subtitles.

However, manual subtitle creation is a long and boring activity and requires the presence of the user.

Consequently, the study of automatic subtitle generation appears to be a valid subject of research.

PROBLEM STATEMENT...

The system should take a video file as input and generate a subtitle file (srt/txt) as output. The Three modules are:-

Audio Extraction:

The audio extraction routine is expected to return a suitable audio format that can be used by the speech recognition

module as pertinent material. It must handle a defined list of video and audio formats. It has to verify the file given in input so that it can evaluate the extraction feasibility. The audio track has to be

returned in the most reliable format.

INTRODUCTION...

Speech Recognition:

The speech recognition routine is the key part of the system. Indeed, it affects directly performance and results

evaluation. First, it must get the type of the input file then, if the type is provided, an appropriate processing method is chosen. Otherwise, the routine uses a default configuration. It must be

able to recognize silences so that text delimitations can be established.

Subtitle Generation:

The subtitle generation routine aims to create and write in a file in order to add multiple chunks of text corresponding to

utterances limited by silences and their respective start and end times. Time synchronization considerations are of main

importance.

BENEFITS OF USING SUBTITLES....

The major benefit is that the viewer does not need to download the subtitle from internet if he wants to watch the video with subtitle.

Captions help children with word identification, meaning, acquisition, and retention.

Captions can help children establish a systematic link between the written word and the spoken word.

Captioning has been related to higher comprehension skills when compared to viewers watching the same media without captions.

Captions provide missing information for individuals who have

difficulty processing speech and auditory components of the visual media (regardless of whether this difficulty is due to a hearing loss).

Captioning is essential for children who are deaf and hard of hearing, can be very beneficial to those learning English as a second language, can help those with reading and literacy problems, and can help those who are learning to read.

CONTINUED....

H E R E C O M E S Y O U R F O O T E R P A G E 7

OVERALL ARCHITECTURE

OFTHE PROJECT

#1

FLOW DIAGRAM

H E R E C O M E S Y O U R F O O T E R P A G E 9

#2

USE CASE DIAGRAM

#3

ACTIVITY DIAGRAM

H E R E C O M E S Y O U R F O O T E R P A G E 1 3

AUDIO EXTRACTION…


SPEECH RECOGNITION…


SUBTITLE GENERATION…


TECHNOLOGY &

TOOLS USED

FFMPEG…


FFMPEG libraries are used to do most of our multimedia tasks quickly and easily say, audio compression, audio/video format conversion, extract images from a video and a lot more. It can be used by developers for transcoding, streaming and playing.

It is very stable framework for transcoding of videos and audio.

JAVA SPEECH API…

It allows developers to incorporate speech technology into user interfaces for their Java programming language applets and applications. This API specifies a cross-platform interface to support command and control

recognizers, dictation systems and speech synthesizers. . Sun has also developed the JSGF(Java Speech Grammar Format) to provide cross-platform grammar of speech

recognizers .

CURRENT PROBLEMS…


Robustness.

Automatic generation of word lexicons.

Finding the theoretical limit for FSM implementations of ASR systems.

Optimal utterance verification-rejection algorithms.

Accuracy and Word Error Rate.

Filling up missing offset samples with silence.

Synchronize between tracks.


FUNCTIONAL REQUIREMENT

S

All MPEG standard formats are supported like MP2, MP3 etc. for audio/video.

Audio of any format can be extracted but speech recognition is done only in English.

The extracted text from the audio/video is in the .srt format. The text displayed will have a readable format

Captions appear on-screen long enough to be read. It is preferable to limit on-screen captions to no more than two lines. Captions are synchronized with spoken words.

User can convert the extracted audio in any suitable format supported under MPEG standards.

NON-FUNCTIONAL

REQUIREMENTS

System Requirements – The software is compatible on all the Operating Systems. The user needs to install the .exe file of the software in their PCs.

Security – The system has no security constraints. Performance – The text is synchronized with the song. Maintainability – The software is easy to maintain. Reliability - The software will provide a good level of

precision. Modifiability- The software cannot be modified by

external user. Scalability- The software is scalable as a number of

users can utilize it for their benefits simultaneously.

PROPOSED ALGORITHMS

MP3 ALGORITHM…

1. Initialize i=0, j=1.

2. tincr = 1.0 / sample_rate

3. dstp = dst, c = 2 * M_PI * 440.0;

4. Generate sin tone with 440Hz frequency and duplicated channels

5. Check if i < nb_samplesIf it is true then generate ths sine wave and store it in dstp = sin(c * *t)

6. Check if j < nb_channels

7. Store the packets in the destination buffer.

8. Increment dstp += nb_channels and t += tincr

9. Repeat till the dst buffer is filled with nb_samples, generated starting from t

MFCC (MEL FREQUENCY CEPSTRAL COEFFECIENT)

Check if Delta frequency which is the ratio between sample rate and number of fft points if (deltaFreq == 0) { Print “deltaFreq has zero value"; }Check if the left and right boundaries of the filter are too close. if ((Math.round(rightEdge - leftEdge) == 0)|| (Math.round(centerFreq - leftEdge) == 0) || (Math.round(rightEdge - centerFreq) == 0)) { throw new IllegalArgumentException("Filter boundaries too close"); } Find how many frequency bins we can fit in the current frequency range. numberElementsWeightField =(int) Math.round((rightEdge - leftEdge) / deltaFreq + 1); Initialize the weight field. if (numberElementsWeightField == 0) { throw new IllegalArgumentException("Number of elements in mel" + " is zero."); } weight = new double[numberElementsWeightField];

CONTINUED…

filterHeight = 2.0f / (rightEdge - leftEdge);

Now compute the slopes based on the height.

leftSlope = filterHeight / (centerFreq - leftEdge);

rightSlope = filterHeight / (centerFreq - rightEdge);

Now let's compute the weight for each frequency bin.

for (currentFreq = initialFreq, indexFilterWeight = 0; currentFreq <= rightEdge; currentFreq += deltaFreq, indexFilterWeight++) {

if (currentFreq < centerFreq) {

weight[indexFilterWeight] = leftSlope * (currentFreq - leftEdge); } else {

weight[indexFilterWeight] = filterHeight + rightSlope * (currentFreq - centerFreq);

}}

Convert linear frequency to mel frequency

private double linToMelFreq(double inputFreq) {

return (2595.0 * (Math.log(1.0 + inputFreq / 700.0) / Math.log(10.0))); }

http://www.ppt-vorlagen.de/

IMPLEMENTATION

#1

AUDIO EXTRACTION

#2

SPEECH RECOGNITION

#3

SUBTITLE GENERATION

RISK AND ITS IMPACT

Risk

ID

Classification Description of

Risk

Risk Area Probability Impact RE

(P*I)

1. Product

Engineering

Word Error Rate Performance L H M

2. Product

Engineering

Aliasing Performance M M M

3. Development

Environment

Bitrate of

extracted audio

more than that of

input audio

Testing

Environment

L L L

4. Product

Engineering

Accuracy and

Speed

Performance L H M

5. Program

Constraint

Format not

recognized

External Input L H M

RISK AND MITIGATION

PLANS

Risk ID Description of Risk Risk Area Mitigation

1. Word Error Rate Performance Having an effecient

database (Training

Set).

2. Aliasing Performance Resampling the

samples at a fix

frequency.

3. Bitrate of extracted audio

more than that of input audio

Testing Environment Encode and Decode

audio at the bitrate of

the input audio.

4. Accuracy and Speed Performance Synchronization

5. Format not recognized External Input Input audio/video

supported by MPEG

standard formats.


BLACK BOX TESTING

Test Case ID Input Expected Output Status

1. 1.1 File.mp3 File.mp3 Pass

1.2 File.mp4 File.mp3 Pass

1.3 File.mp2 File.mp3 Pass

1.4 File.au File.au Pass

1.5 File.aac File.aac Pass

1.6 File.wav File.wav Pass

1.7 File.flac File.flac Pass

1.8 File.wma (format not supported by

MPEG standards)

File.wma Fail

1.9 File.als (format not supported by

MPEG standards)

File.als Fail

2. 2.1 File.wav (Words present in the

dictionary)

Speech Recognized.

Text Printed.

Pass

2.2 File.mp3 (not a .wav file) Speech Recognized.

Text Printed.

Fail

2.3 File.au (not a .wav file) Speech Recognized.

Text Printed.

Fail

2.4 File.flac (not a .wav file) Speech Recognized.

Text Printed.

Fail

2.5 File.wav (Words not found in the

Dictionary)

Speech Recognized.

Text Printed.

Fail

3. 3.1 File.srt (Incorrect Timecode) Subtitles generated but

synchronized with the video

Fail

3.2 File.srt (Correct Timecode)

File.avi

Subtitles generated and

synchronized with the video file

File.avi

Pass

3.3 File.txt (not containing the

Timecode)


synchronized with the video

Fail


File.mp4



File.mp4

Pass


File.wma



File.wma

Pass

WHITE BOX TESTING


AUDIO EXTRACTION…

CC=E-N+2Where,

E=No. of Edges(80)

N=No. of Nodes(72)

CC=80-72+2=10

CYCLOMATIC COMPLEXITY…

SPEECH RECOGNITION…

CC=E-N+2Where,

E=No. of Edges(80)

N=No. of Nodes(72)

CC=98-91+2=9

CYCLOMATIC COMPLEXITY…

ERROR & EXCEPTION

HANDLING

Test Case ID Components Debugging Technique

1.8 Audio Extraction Backtracking Debugging

1.9 Audio Extraction Backtracking Debugging

2.2 Speech Recognition Backtracking Debugging



2.5 Speech Recognition Print Debugging

3.1 Subtitles Generation Print Debugging

3.3 Subtitles Generation Backtracking Debugging

Test Case ID Input Expected Output Status

1.8 File.au (format

supported by MPEG

standards)

File.au Pass

1.9 File.mp4 (format

supported by MPEG

standards)

File.mp3 Pass

2.2 File.wav Speech Recognized.

Text Printed.

Pass


Text Printed.

Pass


Text Printed.

Pass

2.5 File.wav (Words found

in the Dictionary)

Speech Recognized.

Text Printed.

Pass

3.1 File.srt (Correct

Timecode)


synchronized with the

video

Pass

3.3 File.srt Subtitles generated and

synchronized with the

video

Pass

RESEARCH WORK

DETAILED STUDY OF INPUT AND EXTRACTED FILES…

Time Taken

for Extract

ion (in ms)

Size Bitrate Size Bitrate(MB) (kbps) (MB) (kbps)

1 Despicable.avi

10.8 1628 8.24 1411 00:49 0.6 24%

2 Time.mp4 48.1 1663 44.4 1536 04:02 3.12 8%

3Florida.mp4 76 2723 39.3 1411 03:54 1.08 48%

4International.mp4 79.1 2673 41.7 1411 04:08 1.3 47%

5 Justin.mp4 43.2 1615 41 1536 03:44 1.54 5%

6 Love.mp4 67.1 2112 44.8 1411 04:26 1.98 33%

7 Jojo.avi 61.8 2183 39.9 1411 03:57 1.86 35%

8 Baby.mp4 43.2 1615 41 1536 03:44 3.34 5%

9 Never.mp4 52.5 1657 48.5 1536 04:25 2.15 8%

10 Beep.avi 51.4 1628 38.4 1411 03:48 01:58 25%

Average 53.3 1950 38.7 1461 03:41 1.71 24%

Reduction Rate

S. No.

Input FileBefore Audio

ExtractionAfter Audio

Extraction

Length of the

input/output file

(min:sec)

COMPARISON BETWEEN THE SIZE OF THE INPUT FILE AND THE EXTRACTED FILE

Despi

cabl

e.av

i

Time.

mp4

Florid

a.m

p4

Inte

rnat

iona

l.mp4

Just

in.m

p4

Love.

mp4

Jojo

.avi

Baby.

mp4

Nev

er.m

p4

Beep.

avi

0

40

80

Size Before Ex-traction(MB)Size After Ex-traction(MB)

Input Files (.mp4/.avi)

Siz

e o

f fi

le (

in M

B)

From the above graph we can observe that the size of each input file is reduced as the audio has been extracted from the input video. The

maximum reduction rate of the size of the file is 0.48 and the minimum reduction is 0.05 giving an average reduction rate of 24%.

COMPARISON BETWEEN THE BITRATE OF THE INPUT FILE AND THE EXTRACTED FILE

Despi

cabl

e.av

i

Time.

mp4

Florid

a.m

p4

Inte

rnat

iona

l.mp4

Just

in.m

p4

Love.

mp4

Jojo

.avi

Baby.

mp4

Nev

er.m

p4

Beep.

avi

0100020003000

Bitrate Before Ex-traction(kbps)Bitrate After Ex-traction(kbps)


Bit

rate

(in

kbp

s)

The bitrates of each of the input files range from 1615kbps to 2723kbps and the bitrates of the extracted files reduces to a minimum of

1411kbps and maximum of 1536kbps giving an average bitrate of 1461kbps.

TIME TAKEN FOR EXTRACTION OF INPUT FILE

0

1

2

3

4

Time Taken for Ex-traction (in ms)


Tim

e(

in m

s)

The time taken to extract each files vary from 0.6 ms to 3.34 ms with the average extraction time of 1.71 ms


CONCLUSION

The ASG aims at automatically generating the text for the input audio/video.

It supports all the MPEG standards. The video and subtitles are synchronized. User can extract audio in any MPEG standard formats. Audio of any format can be extracted but speech

recognition

[1] B. H. Juang; L. R. Rabiner, “Hidden Markov Models for Speech Recognition” Journal of

Technometrics, Vol.33, No. 3. Aug., 1991.

[2] Hong Zhou and Changhui Yu , “Research and design of the audio coding scheme ,” IEEE

Transactions on Consumer Electronics, International Conference on Multimedia

Technology(ICMT) 2011.

[3] Seymour Shlien,”Guide to MPEG-1 Audio Standard”, Broadcast Technology, IEEE

Transactions on Broadcasting, December 1994.

[4] Justin Burdick, “Building a Regionally Inclusive Dictionary for Speech Recognition”,

Computer Science and Linguistics, Spring 2004.

[5] Anand Vardhan Bhalla, Shailesh Khaparkar, “Performance Improvement of Speaker

Recognition System”,International Journal of Advanced Research in Computer Science

and Software Engineering, Volume 2, Issue 3, March 2012.

[6] Petr Pollak, Martin Behunek, “Accuracy of MP3 Speech Recognition Under Real-World

Conditions”, Electrical Engineering, Czech Technical University in Prague, Technick´a 2.

REFERENCES…

[7] Yu Li, LingHua Zhang, “Implementation and Research of Streaming Media System and AV Codec Based on Handheld Devices” 12th IEEE International Conference on Communication Technology (ICCT), 2010. [8] Ibrahim Patel1 Dr. Y. Srinivas Rao, “Speech Recognition Using HMM with MFCC- An Analysis using Frequency Spectral Decomposition Technique”, Signal & Image Processing: An International Journal(SIPIJ), Vol.1, No.2, December 2010. [9] Jorge Martinez, Hector Perez, Enrique Escamilla, Masahisa Mabo Suzuki,” Speaker recognition using Mel Frequency Cepstral Coefficients (MFCC) and Vector Quantization (VQ) Techniques”, 22nd International Conference on Electrical Communications and Computers (CONIELECOMP), 2012. [10] Sadaoki Furui, Li Deng, Mark Gales,Hermann Ney, and Keiichi Tokuda,, ” Fundamental Technologies in Modern Speech Recognition”, Signal Processing, IEEE Signal Processing Society, November 2012. [11] Youhao Yu “Research on Speech Recognition Technology and Its Application”, Electronics and Information Engineering, International Conference on Computer Science and Electronics Engineering, 2012.

CONTINUED…

Abhinav Mathur, Tanya Saxena, “Generating Subtitles Automatically using Audio Extraction and Speech

Recognition”, 7th International Conference on Contemporary Computing (IC3), 2014. (Under Review).

PUBLICATION…

THANK YOU

Automatic subtitle generation

Software

Transcript of Automatic subtitle generation