Automatic subtitle generation
-
Upload
tanyasaxena1611 -
Category
Software
-
view
1.532 -
download
3
Transcript of Automatic subtitle generation
Ihr Logo
Supervisor: Submitted by:
K. Rajalakshmi Tanya Saxena(10503894)
Abhinav Mathur(10503858)
MAJOR PROJECT
Automatic Subtitle Generation from Videos
Video has become one of the most popular multimedia artefacts used on PCs and the Internet. In a
majority of cases within a video, the sound holds an important place. From this statement, it appears essential to
make the understanding of a sound video available for people with auditory problems as well as for people with
gaps in the spoken language. The most natural way lies in the use of subtitles.
However, manual subtitle creation is a long and boring activity and requires the presence of the user.
Consequently, the study of automatic subtitle generation appears to be a valid subject of research.
PROBLEM STATEMENT...
The system should take a video file as input and generate a subtitle file (srt/txt) as output. The Three modules are:-
Audio Extraction:
The audio extraction routine is expected to return a suitable audio format that can be used by the speech recognition
module as pertinent material. It must handle a defined list of video and audio formats. It has to verify the file given in input so that it can evaluate the extraction feasibility. The audio track has to be
returned in the most reliable format.
INTRODUCTION...
Speech Recognition:
The speech recognition routine is the key part of the system. Indeed, it affects directly performance and results
evaluation. First, it must get the type of the input file then, if the type is provided, an appropriate processing method is chosen. Otherwise, the routine uses a default configuration. It must be
able to recognize silences so that text delimitations can be established.
Subtitle Generation:
The subtitle generation routine aims to create and write in a file in order to add multiple chunks of text corresponding to
utterances limited by silences and their respective start and end times. Time synchronization considerations are of main
importance.
BENEFITS OF USING SUBTITLES....
The major benefit is that the viewer does not need to download the subtitle from internet if he wants to watch the video with subtitle.
Captions help children with word identification, meaning, acquisition, and retention.
Captions can help children establish a systematic link between the written word and the spoken word.
Captioning has been related to higher comprehension skills when compared to viewers watching the same media without captions.
Captions provide missing information for individuals who have
difficulty processing speech and auditory components of the visual media (regardless of whether this difficulty is due to a hearing loss).
Captioning is essential for children who are deaf and hard of hearing, can be very beneficial to those learning English as a second language, can help those with reading and literacy problems, and can help those who are learning to read.
CONTINUED....
H E R E C O M E S Y O U R F O O T E R P A G E 7
OVERALL ARCHITECTURE
OFTHE PROJECT
#1
FLOW DIAGRAM
H E R E C O M E S Y O U R F O O T E R P A G E 9
#2
USE CASE DIAGRAM
#3
ACTIVITY DIAGRAM
H E R E C O M E S Y O U R F O O T E R P A G E 1 3
AUDIO EXTRACTION…
H E R E C O M E S Y O U R F O O T E R P A G E 1 4
SPEECH RECOGNITION…
H E R E C O M E S Y O U R F O O T E R P A G E 1 5
SUBTITLE GENERATION…
H E R E C O M E S Y O U R F O O T E R P A G E 1 6
TECHNOLOGY &
TOOLS USED
FFMPEG…
H E R E C O M E S Y O U R F O O T E R P A G E 1 7
FFMPEG libraries are used to do most of our multimedia tasks quickly and easily say, audio compression, audio/video format conversion, extract images from a video and a lot more. It can be used by developers for transcoding, streaming and playing.
It is very stable framework for transcoding of videos and audio.
JAVA SPEECH API…
It allows developers to incorporate speech technology into user interfaces for their Java programming language applets and applications. This API specifies a cross-platform interface to support command and control
recognizers, dictation systems and speech synthesizers. . Sun has also developed the JSGF(Java Speech Grammar Format) to provide cross-platform grammar of speech
recognizers .
CURRENT PROBLEMS…
H E R E C O M E S Y O U R F O O T E R P A G E 1 9
Robustness.
Automatic generation of word lexicons.
Finding the theoretical limit for FSM implementations of ASR systems.
Optimal utterance verification-rejection algorithms.
Accuracy and Word Error Rate.
Filling up missing offset samples with silence.
Synchronize between tracks.
H E R E C O M E S Y O U R F O O T E R P A G E 2 0
FUNCTIONAL REQUIREMENT
S
All MPEG standard formats are supported like MP2, MP3 etc. for audio/video.
Audio of any format can be extracted but speech recognition is done only in English.
The extracted text from the audio/video is in the .srt format. The text displayed will have a readable format
Captions appear on-screen long enough to be read. It is preferable to limit on-screen captions to no more than two lines. Captions are synchronized with spoken words.
User can convert the extracted audio in any suitable format supported under MPEG standards.
NON-FUNCTIONAL
REQUIREMENTS
System Requirements – The software is compatible on all the Operating Systems. The user needs to install the .exe file of the software in their PCs.
Security – The system has no security constraints. Performance – The text is synchronized with the song. Maintainability – The software is easy to maintain. Reliability - The software will provide a good level of
precision. Modifiability- The software cannot be modified by
external user. Scalability- The software is scalable as a number of
users can utilize it for their benefits simultaneously.
PROPOSED ALGORITHMS
MP3 ALGORITHM…
1. Initialize i=0, j=1.
2. tincr = 1.0 / sample_rate
3. dstp = dst, c = 2 * M_PI * 440.0;
4. Generate sin tone with 440Hz frequency and duplicated channels
5. Check if i < nb_samplesIf it is true then generate ths sine wave and store it in dstp = sin(c * *t)
6. Check if j < nb_channels
7. Store the packets in the destination buffer.
8. Increment dstp += nb_channels and t += tincr
9. Repeat till the dst buffer is filled with nb_samples, generated starting from t
MFCC (MEL FREQUENCY CEPSTRAL COEFFECIENT)
Check if Delta frequency which is the ratio between sample rate and number of fft points if (deltaFreq == 0) { Print “deltaFreq has zero value"; }Check if the left and right boundaries of the filter are too close. if ((Math.round(rightEdge - leftEdge) == 0)|| (Math.round(centerFreq - leftEdge) == 0) || (Math.round(rightEdge - centerFreq) == 0)) { throw new IllegalArgumentException("Filter boundaries too close"); } Find how many frequency bins we can fit in the current frequency range. numberElementsWeightField =(int) Math.round((rightEdge - leftEdge) / deltaFreq + 1); Initialize the weight field. if (numberElementsWeightField == 0) { throw new IllegalArgumentException("Number of elements in mel" + " is zero."); } weight = new double[numberElementsWeightField];
CONTINUED…
filterHeight = 2.0f / (rightEdge - leftEdge);
Now compute the slopes based on the height.
leftSlope = filterHeight / (centerFreq - leftEdge);
rightSlope = filterHeight / (centerFreq - rightEdge);
Now let's compute the weight for each frequency bin.
for (currentFreq = initialFreq, indexFilterWeight = 0; currentFreq <= rightEdge; currentFreq += deltaFreq, indexFilterWeight++) {
if (currentFreq < centerFreq) {
weight[indexFilterWeight] = leftSlope * (currentFreq - leftEdge); } else {
weight[indexFilterWeight] = filterHeight + rightSlope * (currentFreq - centerFreq);
}}
Convert linear frequency to mel frequency
private double linToMelFreq(double inputFreq) {
return (2595.0 * (Math.log(1.0 + inputFreq / 700.0) / Math.log(10.0))); }
IMPLEMENTATION
#1
AUDIO EXTRACTION
#2
SPEECH RECOGNITION
#3
SUBTITLE GENERATION
H E R E C O M E S Y O U R F O O T E R P A G E 3 4
RISK AND ITS IMPACT
Risk
ID
Classification Description of
Risk
Risk Area Probability Impact RE
(P*I)
1. Product
Engineering
Word Error Rate Performance L H M
2. Product
Engineering
Aliasing Performance M M M
3. Development
Environment
Bitrate of
extracted audio
more than that of
input audio
Testing
Environment
L L L
4. Product
Engineering
Accuracy and
Speed
Performance L H M
5. Program
Constraint
Format not
recognized
External Input L H M
RISK AND MITIGATION
PLANS
Risk ID Description of Risk Risk Area Mitigation
1. Word Error Rate Performance Having an effecient
database (Training
Set).
2. Aliasing Performance Resampling the
samples at a fix
frequency.
3. Bitrate of extracted audio
more than that of input audio
Testing Environment Encode and Decode
audio at the bitrate of
the input audio.
4. Accuracy and Speed Performance Synchronization
5. Format not recognized External Input Input audio/video
supported by MPEG
standard formats.
H E R E C O M E S Y O U R F O O T E R P A G E 3 9
BLACK BOX TESTING
Test Case ID Input Expected Output Status
1. 1.1 File.mp3 File.mp3 Pass
1.2 File.mp4 File.mp3 Pass
1.3 File.mp2 File.mp3 Pass
1.4 File.au File.au Pass
1.5 File.aac File.aac Pass
1.6 File.wav File.wav Pass
1.7 File.flac File.flac Pass
1.8 File.wma (format not supported by
MPEG standards)
File.wma Fail
1.9 File.als (format not supported by
MPEG standards)
File.als Fail
2. 2.1 File.wav (Words present in the
dictionary)
Speech Recognized.
Text Printed.
Pass
2.2 File.mp3 (not a .wav file) Speech Recognized.
Text Printed.
Fail
2.3 File.au (not a .wav file) Speech Recognized.
Text Printed.
Fail
2.4 File.flac (not a .wav file) Speech Recognized.
Text Printed.
Fail
2.5 File.wav (Words not found in the
Dictionary)
Speech Recognized.
Text Printed.
Fail
3. 3.1 File.srt (Incorrect Timecode) Subtitles generated but
synchronized with the video
Fail
3.2 File.srt (Correct Timecode)
File.avi
Subtitles generated and
synchronized with the video file
File.avi
Pass
3.3 File.txt (not containing the
Timecode)
Subtitles generated and
synchronized with the video
Fail
3.4 File.srt (Correct Timecode)
File.mp4
Subtitles generated and
synchronized with the video file
File.mp4
Pass
3.5 File.srt (Correct Timecode)
File.wma
Subtitles generated and
synchronized with the video file
File.wma
Pass
WHITE BOX TESTING
H E R E C O M E S Y O U R F O O T E R P A G E 4 3
AUDIO EXTRACTION…
CC=E-N+2Where,
E=No. of Edges(80)
N=No. of Nodes(72)
CC=80-72+2=10
CYCLOMATIC COMPLEXITY…
SPEECH RECOGNITION…
CC=E-N+2Where,
E=No. of Edges(80)
N=No. of Nodes(72)
CC=98-91+2=9
CYCLOMATIC COMPLEXITY…
ERROR & EXCEPTION
HANDLING
Test Case ID Components Debugging Technique
1.8 Audio Extraction Backtracking Debugging
1.9 Audio Extraction Backtracking Debugging
2.2 Speech Recognition Backtracking Debugging
2.3 Speech Recognition Backtracking Debugging
2.4 Speech Recognition Backtracking Debugging
2.5 Speech Recognition Print Debugging
3.1 Subtitles Generation Print Debugging
3.3 Subtitles Generation Backtracking Debugging
Test Case ID Input Expected Output Status
1.8 File.au (format
supported by MPEG
standards)
File.au Pass
1.9 File.mp4 (format
supported by MPEG
standards)
File.mp3 Pass
2.2 File.wav Speech Recognized.
Text Printed.
Pass
2.3 File.wav Speech Recognized.
Text Printed.
Pass
2.4 File.wav Speech Recognized.
Text Printed.
Pass
2.5 File.wav (Words found
in the Dictionary)
Speech Recognized.
Text Printed.
Pass
3.1 File.srt (Correct
Timecode)
Subtitles generated and
synchronized with the
video
Pass
3.3 File.srt Subtitles generated and
synchronized with the
video
Pass
RESEARCH WORK
DETAILED STUDY OF INPUT AND EXTRACTED FILES…
Time Taken
for Extract
ion (in ms)
Size Bitrate Size Bitrate(MB) (kbps) (MB) (kbps)
1 Despicable.avi
10.8 1628 8.24 1411 00:49 0.6 24%
2 Time.mp4 48.1 1663 44.4 1536 04:02 3.12 8%
3Florida.mp4 76 2723 39.3 1411 03:54 1.08 48%
4International.mp4 79.1 2673 41.7 1411 04:08 1.3 47%
5 Justin.mp4 43.2 1615 41 1536 03:44 1.54 5%
6 Love.mp4 67.1 2112 44.8 1411 04:26 1.98 33%
7 Jojo.avi 61.8 2183 39.9 1411 03:57 1.86 35%
8 Baby.mp4 43.2 1615 41 1536 03:44 3.34 5%
9 Never.mp4 52.5 1657 48.5 1536 04:25 2.15 8%
10 Beep.avi 51.4 1628 38.4 1411 03:48 01:58 25%
Average 53.3 1950 38.7 1461 03:41 1.71 24%
Reduction Rate
S. No.
Input FileBefore Audio
ExtractionAfter Audio
Extraction
Length of the
input/output file
(min:sec)
COMPARISON BETWEEN THE SIZE OF THE INPUT FILE AND THE EXTRACTED FILE
Despi
cabl
e.av
i
Time.
mp4
Florid
a.m
p4
Inte
rnat
iona
l.mp4
Just
in.m
p4
Love.
mp4
Jojo
.avi
Baby.
mp4
Nev
er.m
p4
Beep.
avi
0
40
80
Size Before Ex-traction(MB)Size After Ex-traction(MB)
Input Files (.mp4/.avi)
Siz
e o
f fi
le (
in M
B)
From the above graph we can observe that the size of each input file is reduced as the audio has been extracted from the input video. The
maximum reduction rate of the size of the file is 0.48 and the minimum reduction is 0.05 giving an average reduction rate of 24%.
COMPARISON BETWEEN THE BITRATE OF THE INPUT FILE AND THE EXTRACTED FILE
Despi
cabl
e.av
i
Time.
mp4
Florid
a.m
p4
Inte
rnat
iona
l.mp4
Just
in.m
p4
Love.
mp4
Jojo
.avi
Baby.
mp4
Nev
er.m
p4
Beep.
avi
0100020003000
Bitrate Before Ex-traction(kbps)Bitrate After Ex-traction(kbps)
Input Files (.mp4/.avi)
Bit
rate
(in
kbp
s)
The bitrates of each of the input files range from 1615kbps to 2723kbps and the bitrates of the extracted files reduces to a minimum of
1411kbps and maximum of 1536kbps giving an average bitrate of 1461kbps.
TIME TAKEN FOR EXTRACTION OF INPUT FILE
0
1
2
3
4
Time Taken for Ex-traction (in ms)
Input Files (.mp4/.avi)
Tim
e(
in m
s)
The time taken to extract each files vary from 0.6 ms to 3.34 ms with the average extraction time of 1.71 ms
H E R E C O M E S Y O U R F O O T E R P A G E 5 5
CONCLUSION
The ASG aims at automatically generating the text for the input audio/video.
It supports all the MPEG standards. The video and subtitles are synchronized. User can extract audio in any MPEG standard formats. Audio of any format can be extracted but speech
recognition
[1] B. H. Juang; L. R. Rabiner, “Hidden Markov Models for Speech Recognition” Journal of
Technometrics, Vol.33, No. 3. Aug., 1991.
[2] Hong Zhou and Changhui Yu , “Research and design of the audio coding scheme ,” IEEE
Transactions on Consumer Electronics, International Conference on Multimedia
Technology(ICMT) 2011.
[3] Seymour Shlien,”Guide to MPEG-1 Audio Standard”, Broadcast Technology, IEEE
Transactions on Broadcasting, December 1994.
[4] Justin Burdick, “Building a Regionally Inclusive Dictionary for Speech Recognition”,
Computer Science and Linguistics, Spring 2004.
[5] Anand Vardhan Bhalla, Shailesh Khaparkar, “Performance Improvement of Speaker
Recognition System”,International Journal of Advanced Research in Computer Science
and Software Engineering, Volume 2, Issue 3, March 2012.
[6] Petr Pollak, Martin Behunek, “Accuracy of MP3 Speech Recognition Under Real-World
Conditions”, Electrical Engineering, Czech Technical University in Prague, Technick´a 2.
REFERENCES…
[7] Yu Li, LingHua Zhang, “Implementation and Research of Streaming Media System and AV Codec Based on Handheld Devices” 12th IEEE International Conference on Communication Technology (ICCT), 2010. [8] Ibrahim Patel1 Dr. Y. Srinivas Rao, “Speech Recognition Using HMM with MFCC- An Analysis using Frequency Spectral Decomposition Technique”, Signal & Image Processing: An International Journal(SIPIJ), Vol.1, No.2, December 2010. [9] Jorge Martinez, Hector Perez, Enrique Escamilla, Masahisa Mabo Suzuki,” Speaker recognition using Mel Frequency Cepstral Coefficients (MFCC) and Vector Quantization (VQ) Techniques”, 22nd International Conference on Electrical Communications and Computers (CONIELECOMP), 2012. [10] Sadaoki Furui, Li Deng, Mark Gales,Hermann Ney, and Keiichi Tokuda,, ” Fundamental Technologies in Modern Speech Recognition”, Signal Processing, IEEE Signal Processing Society, November 2012. [11] Youhao Yu “Research on Speech Recognition Technology and Its Application”, Electronics and Information Engineering, International Conference on Computer Science and Electronics Engineering, 2012.
CONTINUED…
Abhinav Mathur, Tanya Saxena, “Generating Subtitles Automatically using Audio Extraction and Speech
Recognition”, 7th International Conference on Contemporary Computing (IC3), 2014. (Under Review).
PUBLICATION…
THANK YOU