Saras Shareable Rich Media Learning Object Repositories and Management for e-Learning
Chitra DoraiIBM T.J. Watson Research CenterNew [email protected]
(Saras(wati), a Sanskrit word for flow of knowledge/Goddess of Learning)
Overview of e-Learning Content Management Research
E-learning media semantic analysis for metadata generation
SCORM and MPEG-7 conformant asset metadata model
Search and browse client interfaces
Text, ImagesCourse catalogs,
Student Assessments
Content Manager
Asset Repository
Asset Repository
Asset Repository
Search & Browse ClientLO ingest
Learning Management SystemLearning Authoring Tool
E-Learning Media Analyzer Metadata
Audio, Video
SCORM / MPEG-7 Data Model
( DD )
Discussion Sections
Narration sections
Dialog, interviews,...
raw footage,text, ...
Video
On-screen
narration Voice Over
Direct Narration
Assistive Narration
Uninterrupted Voice Over
Interrupted Voice Over
Linkage Sections
( DN ) (A N ) (UV ) (IV ) (LF )
Multimodal narrative structure analysis for partitioning of instructional media
Manage learning assets of various types
Middleware for shareable learning object repositories
Metadata model creation from XML schema
Project Goals
Develop SCORM support technologies
• Enable generic content repositories (CMv8 and DB2) to support standards compliant e-learning and transform into shareable and interoperable learning object repositories
• Analyze instructional media for automated SCORM/MPEG-7 compliant metadata generation
• The Department of Defense (DoD) established Advanced Distributed Learning (ADL) initiative in 1997.
• ADL develops strategy for using learning and information technologies to modernize education and training on the Web, and to promote e-learning standardization.
• SCORM (Shareable Content Object Reference Model): ADL reference model for shareable learning content objects that enable interoperability, accessibility and reusability of Web-based learning content.
• Content Aggregation Model: LO Metadata, Content Packaging• SCORM is built on many e-Learning standardization efforts --- AICC,
IMS, IEEE LOM (became a standard in 06/02), ARIADNE.
E-Learning and Standards
SCORM LOM Overview
• Nine learning object metadata categories from IEEE LOM specification– General, Lifecycle, Meta-metadata, Technical, Educational,
Rights, Relation, Annotation, and Classification
• IMS’s XML binding specification for metadata representation
• Describe three content model components– Asset, Sharable Content Object (SCO), Content Aggregation
Enabling Content Repositories for e-Learning
Objective:
Develop middleware tools to enable content management products (IBM CM v8) and databases (DB2) for standards-based e-Learning archival and for supporting SCORM-compliant learning object metadata.
Creation of SCORM compliant learning object meta-data model on a repository
Automated storage of learning objects and their meta-data in the content repository
Search and retrieval of learning objects based on their meta-data
E-Learning Content Management with Content Manager
Meta-data
Generation Pages
Automated Instructional Media AnalysisObjectives:
– Develop technologies for standards-based e-learning content tagging, supporting shareable and searchable learning object repositories with rich media. • Rich instructional media analysis for automated
extraction of learning objects and their metadata from media for content-based search and browse
Problem with the State of the Art“The user seeks semantic similarity, the
[multimedia] database can only provide similarity on data processing”
• Existing content annotation/management systems cannot ensure reliable content location and access– Fall far short from the expectations of users:
Semantic gap– Generic, low-level annotations that deal only
with characterizing perceived content, not the meaning of it
– Lack of structure in content organization for non-linear navigation
Our Approach to Media Semantics Analysis
New Research Approach:Computational Media Aesthetics is the algorithmic study of visual and aural elements in media and associated analysis of the principles that underlie their manipulation in the creative art of clarifying and interpreting some event for an audience.
Best semantic grid for media interpretation is that within which its creators work - Derive meaning from the production grammar, aesthetic conventions used
Create tools for understanding high-level semantic constructs in a domain by interpreting the data with its maker’s eye, exploiting media production methods for their perceptual and interpretive guidance.
Content RepositoryMedia Semantic Analyzer Metadata
( DD )
Discussion Sections
Narration sections
Dialog, interviews,...
raw footage,text, ...
Video
On-screen narration
Voice Over
Direct Narration
Assistive Narration
Uninterrupted Voice Over
Interrupted Voice Over
Linkage Sections
( DN ) (A N ) (UV ) (IV ) (LF )
Example 1 - Multimodal analysis for extracting hierarchy of narrative structures in education/training video
Focus Areas: Motion picture analysis for affect and story essence using film grammar (recognized w best paper awards)
e-learning; Multimodal algorithms to parse and structure audiovisual content in media for content distillation & nonlinear browsing
Multigranular media narrative segmentation to generate & annotate reusable assets
Tempo in Titanic Tempo ebb and flow and associated story
elements and events automatically deconstructed
Example 2 - Titanic Movie Analysis for Tempo
ExampleNarrative Structure Based Segmentation of Education
and Training Videos
Problem Statement: Automatically structuralize instructional media through high-level semantics-based video partitioning and content tagging for effective segment search, access, and browse services in e-learning content management systems
Joint Work with Dinh Q. Phung and Svetha Venkatesh, Curtin University of Technology, W. Australia
Narrative Structures Hierarchy
Discussion
sections
Direct Narratio
n
Assistive
Narration
Un-interrupted
VO
Interrupted VO
Linkage Section
s
On-screen Narration
Voice Over
Narration Sections
Raw footage, text, …
Dialog, interviews, …
Narrative Structures Hierarchy: Discussion Sections
Discussion
sections
Direct Narratio
n
Assistive
Narration
Un-interrupted
VO
Interrupted VO
Linkage Section
s
On-screen Narration
Voice Over
Narration Sections
Raw footage, text, …
Dialog, interviews, …
Capture dialog, interviews, meeting sections.
Narrative Structures Hierarchy: On-Screen Narration
Discussion
sections
Direct Narratio
n
Assistive
Narration
Un-interrupted
VO
Interrupted VO
Linkage Section
s
On-screen Narration
Voice Over
Narration Sections
Raw footage, text, …
Dialog, interviews, …
Clear view of a narrator speaking in the scene.
Dominated by narrator’s face and captured in a close-up.
Interrupted presence of the narrator.
Narrative Structures Hierarchy: Voice Overs
Discussion
sections
Direct Narratio
n
Assistive
Narration
Un-interrupted
VO
Interrupted VO
Linkage Section
s
On-screen Narration
Voice Over
Narration Sections
Raw footage, text, …
Dialog, interviews, …
The audio track is dominated by the voice of the narrator, but without their appearances (no faces)
smooth and continuousinterrupted
Narrative Structures Hierarchy: Linkage Sections
Discussion
sections
Direct Narratio
n
Assistive
Narration
Un-interrupted
VO
Interrupted VO
Linkage Section
s
On-screen Narration
Voice Over
Narration Sections
Raw footage, text, …
Dialog, interviews, …
Raw footage, superimposed text, and others.
Visual Processing• S = {f1, f2, … , fN}: Sequence of frames from shots in a video for face detection
• Detect faces in frames using CMU’s face detector software
Feature 1: How many faces -- “How many frames contain faces as a proportion of the total frames in a shot ?”
Feature 2: Avg. face areas -- “If there is a face, how big is the face?”
• Two frame sequences from a shot are used: Uniformly sampled and key frames sequence
Audio Processing
• Classify shot audio into voice (V), no-voice (N) or mixture of two (M)
“Is the voice consistently delivered ?”New voice connectivity feature: Number of contiguous speech-dominant clips normalized by the shot length.
Characterize dominance of speech in audio tracks of shots
• Cluster audio clips into two classes and assume the larger cluster as one of clips with speech domination
• N = total # of audio clips within a shot
Nv = # of clips classified as voice-dominated
Va = voice activity = Nv/N
Classification
• Decision Trees as machine learning classifiers for final labeling of narrative structures
• C4.5 algorithm to train and test decision trees
• First learn all six classes at the first children level and test accuracy of labeling
• Propose a two-level decision tree for improved performance
Experimental Results
a b c d e f10 1 0 1 0 0 a = DD
0 29 0 3 0 0 b = DN0 0 12 2 0 0 c = AN0 2 0 480 0 0 d = UV2 0 2 22 0 0 e = IV0 1 0 13 0 0 f = LF
• Average classification result is high: 91.6%
Experimental Results: Confusion Matrix for Six Classes
Exp. Results (cont.)
• Results are very good for classes: DD, DN, AN and UV. However, poor for classes IV and LF
• VO with presences of many faces (meetings, party,..) accounts for most of misclassification
• Solution: group IV, LF and UV into a group G and study separately
a b c d e f10 1 0 1 0 0 a = DD
0 29 0 3 0 0 b = DN0 0 12 2 0 0 c = AN0 2 0 480 0 0 d = UV2 0 2 22 0 0 e = IV0 1 0 13 0 0 f = LF
a b c d e f10 1 0 1 0 0 a = DD
0 29 0 3 0 0 b = DN0 0 12 2 0 0 c = AN0 2 0 480 0 0 d = UV2 0 2 22 0 0 e = IV0 1 0 13 0 0 f = LF
Exp. Results (cont.)
a b c d e f10 1 0 1 0 0 a = DD
0 29 0 3 0 0 b = DN0 0 12 2 0 0 c = AN0 2 0 480 0 0 d = UV2 0 2 22 0 0 e = IV0 1 0 13 0 0 f = LF
G
a b c G1 0 1 0 1 a = D D
0 2 9 0 3 b = D N0 0 1 2 2 c = A N2 3 2 5 1 5 G
97.6%
Exp. Results (cont.)
• Over-fitting is the problem identified in G due to UV instances outnumbering IV and LF
• To solve the problem to a certain extent, reduce number of UV such that number of instances of (IV, UV, LF) are approx. the same, and train with C4.5
a b c
424 40 18 a = UV
14 10 2 b = IV
7 1 6 c = LF
84.3%
Conclusion
• Novel narrative structure based analysis for segmentation of education and training videos
• Hierarchical DT-classification system achieves an overall accuracy of 84.7%
• Focus on higher level semantics such as segmentation of topics
• Work is underway – Map media objects to LOs– Algorithms for support of both SCORM and MPEG-
7 compliant XML metadata
Acknowledgements
Team:
• Geetika Tewari (IBM TJW, currently at Harvard U)
• Norman Haas (IBM TJW)
• Austin Schilling (IBM SWG)
Top Related