The Kinect sensor as human-machine-interface in audio ... · and interpret Kinect data streams...

1
University of Music and Performing Arts Graz Institute of Electronic Music and Acoustics The Kinect sensor as human-machine-interface in audio-visual art projects Matthias Kronlachner IOhannes m zm ¨ olnig Student [email protected] [email protected] Introduction For several years now, the entertainment and gaming industry has been providing multifunctional and cheap human interface devices which can be used for artistic applications. Since November 2010 a sensor called Kinect TM for Microsoft’s Xbox 360 is available. This input device is used as color camera, 4 channel microphone array, and provides - as an industry-first - a depth image camera at an affordable price. Soon after the release of the Xbox-only device, programmers from all over the world provided solutions to access the data from a multitude of operating systems running on ordinary computers. Operating system independent ways are presented to access and interpret Kinect data streams within Pure Data/Gem a . Overviews of case studies for interactive audio and visual art projects are given. All developed software is Open Source and working under Linux, Mac OS X and Windows. a Pure Data is a visual programming environment used for computermusic and interactive multimedia applications. Gem stands for Graphics Environment for Multimedia and extends Pure Data to do realtime OpenGL based visualizations. Kinect streams in Pure Data/Gem using pix_freenect, pix_depth2rgba, pix_threshold_depth gemhead pix_texture pix_texture gemhead gemhead for rgb and for depth rgb pix_freenect 0 1 1 pix_separator pix_texture depth-raw color-gradient mapping angle $1 rectangle 2 1.5 rectangle 2 1.5 rectangle 2 1.5 pix_separator t a a a pix_threshold_depth 300 1200 pix_texture rectangle 2 1.5 separator separator separator translateXYZ 2 0 0 translateXYZ 6 0 0 translateXYZ -6.2 0 0 translateXYZ -2.1 0 0 pix_depth2rgba 7000 pix_depth2rgba 1220 background subtraction 2 Fig. 1: RGB stream, raw depth stream, color gradient mapping, background subtraction Representation of depth data A solution for handling depth data is proposed using RGBA or YUV colorspace. For RGBA output the 16 bit depth data is divided into the upper and lower eight significant bits and stored in the red (R) and green (G) channel. The blue channel (B) is used for additional information about the pixel. For example, if a user is present in that specific pixel, the specific user-id is set. R G B A 3/8 msb 8 lsb 0 or userid (OpenNI) 255 YUV422 (2 bytes per pixel) 11 bit/16bit depth values For development and visualization purposes pix_depth2rgba is used to map distance values onto a color gradient (Fig. 2). Fig. 2: Gradient for displaying depth data - near to far Various methods using CPU or GPU processing power can be used to filter out certain regions of the depth image and use them for tracking or as projection stencil. Skeleton tracking with pix_openni pix_openni gives access to higher level functions as multiple user, hand and skeleton tracking (Fig. 3). The output rate depends on the framer- ate of the depth sensor, appr. 30 Hz. gemhead gemhead tracking and info output pd DRAW-SKELETON pix_openni 0 1 1 1 0 pd draw-images Fig. 3: Displaying skeleton retrieved by pix openni vertimas - ¨ ubersetzen The piece ¨ ubersetzen - vertimas for dancer, sound and projection features the Kinect sensor to translate body movements on stage into sound and turns the dancers body into a hyperinstrument. Additionally, the depth video stream is used to gather the outline of the dancer and project back onto her body in realtime (Fig. 4). Therefore an data-flow filtering and analysis library has been developed, providing quickly adjustable methods for translating tracking data into sound or visual control data (Fig. 5). Fig. 4: stage setup vertimas - ¨ ubersetzen t a a motion/median-3 get movement relative to body! motion/gate_thresh 1 1 motion/rcv_2joints user1 r_foot torso 1 /skeleton/joint s /sampler/15/player/15/gain s /sampler/15/player/15/speed / 2.1 motion/change-3d motion/attack-release 100 1000 t f f motion/comp-subtract2 motion/comp-subtract2 pd fit-range filter zeros get difference to last coordinates get vector length motion/comp-3d-length truncate small values 3 point median filter ar-envelope retrieve r_foot and torso Fig. 5: translating skeleton data to control data IEM Computermusic Ensemble ICE is a group of electronic musicians, each playing with a notebook and in- dividual controllers. The target is to play contemporary music, adapted or written for computermusic ensembles. In March 2012 a network concert between Graz and Vilnius took place. One Kinect sensor was used to track the skeleton of two musicians. The tracking data allowed each musician to play his virtual instrument without handheld controllers. Additionally the Kinect video stream showing the stage in Vilnius was sent to Graz and projected on a canvas for the remote audience. (Fig. 6) Kinect and Video Server Instrument/ Audio Server Audience & stage Graz OSC Audience Camera RGB+depth 2 musician playing with one Kinect GÉANT ACOnet, LITNET screen 4 speakers 1st order Ambisonics tracking data Internet to/from Graz Fig. 6: ICE stage setup Vilnius Head pose estimation Based on a paper and software by Gabriele Fanelli[2], the external pix_head_pose_estimation has been developed which takes the depth map of the Kinect as input and estimates the Euler angles and position of multiple detected heads in the depth map. The head tracking can be used to rotate an Ambisonics soundfield for monitoring 3D sound environments via headphones (Fig. 7). Rotation Ambisonics Input Kinect Head Pose Estimation Ambisonics Binaural Decoder Headphones R L Fig. 7: rotating sound field according to head movements References [1] M. Kronlachner, The Kinect distance sensor as human-machine- interface in audio-visual art projects, project report, Institute of Elec- tronic Music and Acoustics, Graz, 2012. [2] G. Fanelli & T. Weise & J. Gall & L. Van Gool, Real Time Head Pose Estimation from Consumer Depth Cameras, DAGM’11. Examples and source code available from the homepage! www.matthiaskronlachner.com

Transcript of The Kinect sensor as human-machine-interface in audio ... · and interpret Kinect data streams...

Page 1: The Kinect sensor as human-machine-interface in audio ... · and interpret Kinect data streams within Pure Data/Gema. Overviews of case studies for interactive audio and visual art

University of Music and Performing Arts GrazInstitute of Electronic Music and Acoustics

The Kinect sensor as human-machine-interface in audio-visual art projects

Matthias Kronlachner IOhannes m zmolnigStudent

[email protected] [email protected]

'

&

$

%

Introduction

For several years now, the entertainment and gaming industry

has been providing multifunctional and cheap human interface

devices which can be used for artistic applications.

Since November 2010 a sensor called KinectTMfor Microsoft’s

Xbox 360 is available. This input device is used as color camera,

4 channel microphone array, and provides - as an industry-first

- a depth image camera at an affordable price. Soon after the

release of the Xbox-only device, programmers from all over the

world provided solutions to access the data from a multitude of

operating systems running on ordinary computers.

Operating system independent ways are presented to access

and interpret Kinect data streams within Pure Data/Gema.

Overviews of case studies for interactive audio and visual art

projects are given.

All developed software is Open Source and working under Linux,

Mac OS X and Windows.

aPure Data is a visual programming environment used for computermusic and interactive

multimedia applications. Gem stands for Graphics Environment for Multimedia and extends

Pure Data to do realtime OpenGL based visualizations.

Kinect streams in Pure Data/Gemusing pix_freenect, pix_depth2rgba, pix_threshold_depth

gemhead

pix_texture pix_texture

gemhead

gemhead for rgb and for depth

rgb

pix_freenect 0 1 1

pix_separator

pix_texture

depth-raw

color-gradient mapping

angle $1

rectangle 2 1.5 rectangle 2 1.5 rectangle 2 1.5

pix_separator

t a a a

pix_threshold_depth

300 1200

pix_texture

rectangle 2 1.5

separator

separator

separator

translateXYZ 2 0 0 translateXYZ 6 0 0translateXYZ -6.2 0 0 translateXYZ -2.1 0 0

pix_depth2rgba 7000 pix_depth2rgba 1220

background subtraction

2

Fig. 1: RGB stream, raw depth stream, color gradient mapping, background subtraction

Representation of depth data

A solution for handling depth data is proposed using RGBA or

YUV colorspace.

For RGBA output the 16 bit depth data is divided into the upper

and lower eight significant bits and stored in the red (R) and

green (G) channel. The blue channel (B) is used for additional

information about the pixel. For example, if a user is present in

that specific pixel, the specific user-id is set.

R G B A

3/8 msb 8 lsb 0 or userid (OpenNI) 255

YUV422 (2 bytes per pixel)

11 bit/16bit depth values

For development and visualization purposes pix_depth2rgba is

used to map distance values onto a color gradient (Fig. 2).

Fig. 2: Gradient for displaying depth data - near to far

Various methods using CPU or GPU processing power can be

used to filter out certain regions of the depth image and use them

for tracking or as projection stencil.

Skeleton tracking with pix_openni

pix_openni gives access to higher level functions as multiple user, hand

and skeleton tracking (Fig. 3). The output rate depends on the framer-

ate of the depth sensor, appr. 30 Hz.

gemhead gemhead

tracking and info output

pd DRAW-SKELETON

pix_openni 0 1 1 1 0

pd draw-images

Fig. 3: Displaying skeleton retrieved by pix openni

'

&

$

%

vertimas - ubersetzen

The piece ubersetzen - vertimas for dancer, sound and projection features the

Kinect sensor to translate body movements on stage into sound and turns the

dancers body into a hyperinstrument. Additionally, the depth video stream

is used to gather the outline of the dancer and project back onto her body in

realtime (Fig. 4).

Therefore an data-flow filtering and analysis library has been developed,

providing quickly adjustable methods for translating tracking data into sound

or visual control data (Fig. 5).

Fig. 4: stage setup vertimas - ubersetzen

t a a

motion/median-3

get movement relative to body!

motion/gate_thresh 1 1

motion/rcv_2joints user1 r_foot torso 1 /skeleton/joint

s /sampler/15/player/15/gain s /sampler/15/player/15/speed

/ 2.1

motion/change-3d

motion/attack-release 100 1000

t f f

motion/comp-subtract2

motion/comp-subtract2

pd fit-range

filter zeros

get difference to last coordinates

get vector lengthmotion/comp-3d-length

truncate small values

3 point median filter

ar-envelope

retrieve r_foot and torso

Fig. 5: translating skeleton data to control data

IEM Computermusic Ensemble

ICE is a group of electronic musicians, each playing with a notebook and in-

dividual controllers. The target is to play contemporary music, adapted or

written for computermusic ensembles.

In March 2012 a network concert between Graz and Vilnius took place. One

Kinect sensor was used to track the skeleton of two musicians. The tracking

data allowed each musician to play his virtual instrument without handheld

controllers. Additionally the Kinect video stream showing the stage in Vilnius

was sent to Graz and projected on a canvas for the remote audience. (Fig. 6)

Kinect and Video Server

Instrument/Audio Server

Audience & stage Graz

OSC

AudienceCamera

RGB+depth

2 musician playing with one Kinect

GÉANTACOnet, LITNET

screen

4 speakers 1st order

Ambisonics

tracking data

Internet

to/from Graz

Fig. 6: ICE stage setup Vilnius

Head pose estimation

Based on a paper and software by Gabriele Fanelli[2], the external

pix_head_pose_estimation has been developed which takes the depth map

of the Kinect as input and estimates the Euler angles and position of multiple

detected heads in the depth map. The head tracking can be used to rotate an

Ambisonics soundfield for monitoring 3D sound environments via headphones

(Fig. 7).

RotationAmbisonics

Input

Kinect Head Pose Estimation

AmbisonicsBinauralDecoder

Headphones

R

L

Fig. 7: rotating sound field according to head movements

References

[1] M. Kronlachner, The Kinect distance sensor as human-machine-

interface in audio-visual art projects, project report, Institute of Elec-

tronic Music and Acoustics, Graz, 2012.

[2] G. Fanelli & T. Weise & J. Gall & L. Van Gool, Real Time

Head Pose Estimation from Consumer Depth Cameras, DAGM’11.

'

&

$

%

Examples and source code available

from the homepage!

www.matthiaskronlachner.com