SpeeG - A Multimodal Speech- and Gesture-based Text Input Solution

Post on 07-Nov-2014

1.554 views 5 download

Tags:

description

Presentation given at AVI 2012, International Working Conference on Advanced Visual Interfaces, Capri Island, Italy, May 2012 ABSTRACT: We present SpeeG, a multimodal speech- and body gesture-basedtext input system targeting media centres, set-top boxes and game consoles. Our controller-free zoomable user interface combines speech input with a gesture-based real-time correction of the recognised voice input. While the open source CMU Sphinx voice recogniser transforms speech input into written text, Microsoft’s Kinect sensor is used for the hand gesture tracking. A modified version of the zoomable Dasher interface combines the input from Sphinx and the Kinect sensor. In contrast to existing speech error correction solutions with a clear distinction between a detection and correction phase, our innovative SpeeG text input system enables continuous real-time error correction. An evaluation of the SpeeG prototype has revealed that low error rates for a text input speed of about six words per minute can be achieved after a minimal learning phase. Moreover, in a user study SpeeG has been perceived as the fastest of all evaluated user interfaces and therefore represents a promising candidate for future controller-free text input. Paper: http://vub.academia.edu/BeatSigner/Papers/1484787/SpeeG_A_Multimodal_Speech-_and_Gesture-based_Text_Input_Solution

Transcript of SpeeG - A Multimodal Speech- and Gesture-based Text Input Solution

SpeeGA  Mul&modal  Speech-­‐  and  

Gesture-­‐based  Text  Input  Solu&on

Lode  Hoste,  Bruno  Dumas  and  Beat  Signer

SpeeG - Lode HosteVrije Universiteit Brussel 2

Text-input for set-top boxes

SpeeG - Lode HosteVrije Universiteit Brussel 3

SpeeG - Lode HosteVrije Universiteit Brussel 4

SpeeG - Lode HosteVrije Universiteit Brussel 5

Text-input for set-top boxes

SpeeG - Lode HosteVrije Universiteit Brussel

Dasher

8PenSwiftKey

Speech Dasher SpeeG

EdgeWriter

1D Keyboard for Kinect Virtual Keyboard for XboxChatpad Controller

6

SpeeG - Lode HosteVrije Universiteit Brussel

Virtual keyboard

7

SpeeG - Lode HosteVrije Universiteit Brussel

Kinect 1D keyboard

8

SpeeG - Lode HosteVrije Universiteit Brussel

Kinect 1D keyboard

9

SpeeG - Lode HosteVrije Universiteit Brussel

Dasher

8PenSwiftKey

Speech Dasher SpeeG

EdgeWriter

1D Keyboard for Kinect Virtual Keyboard for XboxChatpad Controller

10

SpeeG - Lode HosteVrije Universiteit Brussel

Dasher

8PenSwiftKey

Speech Dasher SpeeG

EdgeWriter

1D Keyboard for Kinect Virtual Keyboard for XboxChatpad Controller

11

SpeeG - Lode HosteVrije Universiteit Brussel

Dasher

12

Continuous inputJoystick / Gaze / ...Open vocabularyAllows imprecise navigation

SpeeG - Lode HosteVrije Universiteit Brussel

Dasher

13

SpeeG - Lode HosteVrije Universiteit Brussel

Controller-freeText inputWithout training

14

KinectCMU SphinxDasher

Used technologies:Goals:

SpeeG - Lode HosteVrije Universiteit Brussel

SpeeG

15

SpeeG - Lode HosteVrije Universiteit Brussel 16

SpeeG - Lode HosteVrije Universiteit Brussel

SpeeG Architecture

User

1

GUI (JDasher)

Speech Recogniser(CMU Sphinx 4)

Hand Tracking(Microsoft Kinect and NITE)

5

42

3

17

SpeeG - Lode HosteVrije Universiteit Brussel

Evaluation

18

SpeeGUser

1

GUI (JDasher)

Speech Recogniser(CMU Sphinx 4)

Hand Tracking(Microsoft Kinect and NITE)

5

42

3Speech-only

Virtual Keyboard Kinect Keyboard

SpeeG - Lode HosteVrije Universiteit Brussel

Evaluation

“this was easy for us”“he will allow a rare lie”“did you eat yet”

“my watch fell in the water”“the world is a stage”“peek out the window”

19

7 (male) users: 23-31y

1-3: DARPA’s TIMIT

Performed a quantitative (Words per minute and nr of errors) and qualitative (feedback and preference) evaluation

4-6: MacKenzie and Soukoreff

show 2 about ‘expertise of users’

SpeeG - Lode HosteVrije Universiteit Brussel

0

1

2

3

4

5

6

7

8

9

10

S1 S2 S3 S4 S5 S6

WPM

Sentence

User 1

User 2

User 3

User 4

User 5

User 6

User 7

Virtual keyboard

20

6.3 WPM

SpeeG - Lode HosteVrije Universiteit Brussel

0.00

0.50

1.00

1.50

2.00

2.50

3.00

3.50

S1 S2 S3 S4 S5 S6

WPM

Sentence

User 1

User 2

User 3

User 4

User 5

User 6

User 7

Kinect Keyboard

21

*

1.83 WPM

SpeeG - Lode HosteVrije Universiteit Brussel

0

5

10

15

20

25

30

35

40

S1 S2 S3 S4 S5 S6

WPM

Sentence

User 1

User 2

User 3

User 4

User 5

User 6

User 7

Speech-only

22

User

1

GUI (JDasher)

Speech Recogniser(CMU Sphinx 4)

Hand Tracking(Microsoft Kinect and NITE)

5

42

3

11 WPM

SpeeG - Lode HosteVrije Universiteit Brussel

0

1

2

3

4

5

6

7

8

9

10

S1 S2 S3 S4 S5 S6

WPM

Sentence

User 2

User 1

User 3

User 4

User 5

User 6

User 7

SpeeG

23

5.8 WPM

SpeeG - Lode HosteVrije Universiteit Brussel

0

1

2

3

4

5

6

7

8

9

10

S1 S2 S3 S4 S5 S6

WPM

Sentence

User 2

User 1

User 3

User 4

User 5

User 6

User 7

SpeeG

24

2.6 7.8 WPM

SpeeG - Lode HosteVrije Universiteit Brussel

0

5

10

15

20

25

S1 S2 S3 S4 S5 S6

WPM

Sentence

Controller

Speech only

Kinect only

SpeeG

Mean WPM per sentenceand input device

25

SpeeG

1D Keyboard for XboxVirtual Keyboard for Xbox

Speech-onlyUser

1

GUI (JDasher)

Speech Recogniser(CMU Sphinx 4)

Hand Tracking(Microsoft Kinect and NITE)

5

42

3

SpeeG - Lode HosteVrije Universiteit Brussel 26

0

1

2

3

4

5

6

7

8

9

10

S1 S2 S3 S4 S5 S6

Mea

n nu

mbe

r of e

rror

s

Sentence

Controller Speech only Kinect only SpeeG

SpeeG

1D Keyboard for XboxVirtual Keyboard for Xbox

Speech-onlyUser

1

GUI (JDasher)

Speech Recogniser(CMU Sphinx 4)

Hand Tracking(Microsoft Kinect and NITE)

5

42

3

Errors per sentenceand input device

SpeeG - Lode HosteVrije Universiteit Brussel 27

SpeeG - Lode HosteVrije Universiteit Brussel

Future work

28

Other visualisations Smaller gesturesDedicated commands (gesture / voice)

SpeeG - Lode HosteVrije Universiteit Brussel 29

SpeeG - Lode HosteVrije Universiteit Brussel

Kinect

- Controller-free text input- Real-time correction- Dasher, zoomable interface - probabilities - alphabetic order - character-level

SpeeGA  Mul&modal  Speech-­‐  and  

Gesture-­‐  based  Text  Input  Solu&on Lode  Hoste,  Bruno  Dumas,  Beat  Signer

Speech

- Non-native speakers- Untrained voice recogniser- 6-12 WPM- Perceived fastest- Game-like character- Novice and experts

30Special thanks to Jorn De Baerdenmaeker and Keith Vertaenen