Core Technologies for Multimodal Interactions (II)

22
Core Technologies for Multimodal Interactions (II) 杜俊 [email protected]

Transcript of Core Technologies for Multimodal Interactions (II)

Page 1: Core Technologies for Multimodal Interactions (II)

Core Technologies for

Multimodal Interactions (II)

杜俊 [email protected]

Page 2: Core Technologies for Multimodal Interactions (II)

Multimodal Interaction Technology (MIT)

• OCR: Optical Character Recognition

• HWR: Handwriting Recognition

• Speech Enhancement

Page 3: Core Technologies for Multimodal Interactions (II)

HWR

• Online Handwritten Character Recognition

• Offline Handwritten Character Recognition

• Online Handwritten Text Recognition

• Offline Handwritten Text Recognition

Page 4: Core Technologies for Multimodal Interactions (II)

Motivation

• Requirement of WP8 HWR in China Markets – Support GB18030 (27533 China Characters) – Robust to rotational distortion

• The Status of HWR Product Team – The original team is reorganized to STC – The new team have no capability to build the engine

Page 5: Core Technologies for Multimodal Interactions (II)

Project: Snake: a WP8 HWR engine

Jun Du, Qiang Huo, Kai Chen “Designing compact classifiers for rotation-free recognition of large

vocabulary online handwritten Chinese characters,” ICASSP, 2012

Page 6: Core Technologies for Multimodal Interactions (II)

Training Stage

Rotation Normalization

Feature Extraction

Training Samples

LBG Clustering

SSM-MCE (Rprop)

Split VQ Compression

Building Fast Match Tree

Run-Time Resources

Page 7: Core Technologies for Multimodal Interactions (II)

Split VQ Compression • Split D-dimensional prototype vectors into Q streams • Sub-vectors in different streams are quantized by VQ with different codebooks

Page 8: Core Technologies for Multimodal Interactions (II)

Fast-Match Tree

Page 9: Core Technologies for Multimodal Interactions (II)

Recognition Stage

Rotation Normalization

Feature Extraction

Input character

Level-1 Recognition Using Fast Match Tree Resources

Level-2 Recognition Using Candidate List Final Results

Page 10: Core Technologies for Multimodal Interactions (II)

A Magic for Rotation Normalization

Starting point of each stroke

Ending point of each stroke

Averaged starting point over all strokes

Averaged ending point over all strokes

Page 11: Core Technologies for Multimodal Interactions (II)

Rotation-Sensitive vs. Rotation-Free

Recognizer #1: rotation-sensitive Recognizer #2: rotation-free

Footprint: 3.4M Footprint: 8.8M

Accuracy (%

)

Accuracy (%

)

Page 12: Core Technologies for Multimodal Interactions (II)

Comparison with Mango (WP7.5) System

Current System

Mango System

Vocabulary

27533 Characters

6763 Characters

Footprint

6MB+

2MB

Accuracy

94.5%

94.2%

Recognition Time

Per Character

7ms

11ms

Page 13: Core Technologies for Multimodal Interactions (II)

Summary

• A Chinese HWR solution for WP8 Apollo release – Rotation-free – Fast – Compact – Accurate – Large vocabulary

Page 14: Core Technologies for Multimodal Interactions (II)

Multimodal Interaction Technology (MIT)

• OCR: Optical Character Recognition

• HWR: Handwriting Recognition

• Speech Enhancement

Page 15: Core Technologies for Multimodal Interactions (II)

Project: Speech Enhancement using

Deep Neural Networks

Yong Xu, Jun Du, Li-Rong Dai, Chin-Hui Lee, “An Experimental Study on Speech Enhancement Based on Deep Neural Networks ,” IEEE Signal Processing Letter, 2014

Page 16: Core Technologies for Multimodal Interactions (II)

Speech Signal

Page 17: Core Technologies for Multimodal Interactions (II)

Speech Enhancement

Utterance-based approach Based on additive noise model Wiener filtering, Spectral subtraction, Log-MMSE Musical noise after enhancement

New perspective: data-driven approach Mapping function between clean and noisy data Learning using deep neural networks (DNNs)

Page 18: Core Technologies for Multimodal Interactions (II)

System Overview

Feature Extraction

Clean/Noisy Samples

DNN Training

Enhancement Stage

Noisy Samples

Feature Extraction

DNN Enhancement

Wave Reconstruction

Enhanced Speech

Training Stage

Page 19: Core Technologies for Multimodal Interactions (II)

DNN Training

Page 20: Core Technologies for Multimodal Interactions (II)

White Noise

Noisy speech

DNN approach Traditional approach

Page 21: Core Technologies for Multimodal Interactions (II)

Machine Gun Noise

Noisy speech

DNN approach Traditional approach

Page 22: Core Technologies for Multimodal Interactions (II)

Speech from Movie <Forrest Gump>

Noisy speech: I remember the bus ride on the first day of school very well

DNN approach Traditional approach