3D Gesture Recognition and Tracking for Next Generation of ...699011/FULLTEXT01.pdf · 3D Gesture...
Transcript of 3D Gesture Recognition and Tracking for Next Generation of ...699011/FULLTEXT01.pdf · 3D Gesture...
3D Gesture Recognition and
Tracking for Next Generation of
Smart Devices
Theories, Concepts, and Implementations
SHAHROUZ YOUSEFI
Department of Media Technology and Interaction Design
School of Computer Science and Communication
KTH Royal Institute of Technology
Doctoral Thesis in Media Technology
Stockholm, February 2014
3D Gesture Recognition and Tracking for Next Generation of Smart De-
vices: Theories, Concepts, and Implementations
Shahrouz Yousefi
Department of Media Technology and Interaction Design (MID)
School of Computer Science and Communication (CSC)
KTH Royal Institute of Technology
SE-100 44, Stockholm, Sweden
Author’s e-mail: [email protected]
Akademisk avhandling som med tillstand av Kungliga Tekniska Hogskolan
framlaggs till offentlig granskning for avlaggande av Teknologie Doktorsex-
amen i Medieteknik, mandagen den 17 mars 2014 kl 13:15 i sal F3 Lindst-
edsvagen 26, Kungliga Tekniska Hogskolan, Stockholm.
TRITA-CSC-A-2014-02
ISSN-1653-5723
ISRN-KTH/CSC/A–14/02-SE
ISBN-978-91-7595-031-0
Copyright c© 2014 by Shahrouz Yousefi, All rights reserved.
Typeset in LATEX by Shahrouz Yousefi
E-version available at http://kth.diva-portal.org
Printed by E-print AB, Stockholm, Sweden, 2014
Distributor: KTH School of Computer Science and Communication
Abstract
The rapid development of mobile devices during the recent decade has
been greatly driven by interaction and visualization technologies. Al-
though touchscreens have significantly enhanced the interaction technol-
ogy, it is predictable that with the future mobile devices, e.g., augmented
reality glasses and smart watches, users will demand more intuitive in-
puts such as free-hand interaction in 3D space. Specifically, for manipu-
lation of the digital content in augmented environments, 3D hand/body
gestures will be extremely required. Therefore, 3D gesture recognition
and tracking are highly desired features for interaction design in future
smart environments. Due to the complexity of the hand/body motions,
and limitations of mobile devices in expensive computations, 3D gesture
analysis is still an extremely difficult problem to solve.
This thesis aims to introduce new concepts, theories and technologies
for natural and intuitive interaction in future augmented environments.
Contributions of this thesis support the concept of bare-hand 3D ges-
tural interaction and interactive visualization on future smart devices.
The introduced technical solutions enable an effective interaction in the
3D space around the smart device. High accuracy and robust 3D mo-
tion analysis of the hand/body gestures is performed to facilitate the 3D
interaction in various application scenarios. The proposed technologies
enable users to control, manipulate, and organize the digital content in
3D space.
Keywords: 3D gestural interaction, gesture recognition, gesture tracking,
3D visualization, 3D motion analysis, augmented environments.
Shahrouz Yousefi
February 2014
Sammanfattning
Den snabba utvecklingen av mobila enheter under det senaste decenniet
har i stor utstrackning drivits av interaktion och visualiseringsteknologi.
Aven om pekskarmar avsevart har forbattrat interaktions tekniken ar
det forutsagbart att med framtida mobila enheter, t.ex. augmented real-
ity glasogon och smarta klockor, kommer anvandare kraver mer intuitiva
satt att interagera, sasom ex. fri hand interaktion i 3D-rymden. Speciellt
viktigt blir det vid manipulation av digitalt innehall i utokade miljoer
dar 3D hand/kropp gester kommer att vara ytterst nodvandig. Darfor
ar 3D gest igenkanning och sparning hogt onskade egenskaper for in-
teraktionsdesign i framtida smarta miljoer. Pa grund av komplexiteten
i hand/kroppsrorelser, och begransningar av mobila enheter vid dyra
berakningar, ar 3D- gest analys fortfarande ett mycket svart problem att
losa.
Avhandlingen syftar till att infora nya begrepp, teorier och tekniker for
naturlig och intuitiv interaktion i framtida utokade miljoer. Bidrag fran
denna avhandling stoder begreppet naken-hand 3D gest interaktion och
interaktiv visualisering pa framtida smarta enheter. De inforda tekniska
losningar mojliggor effektiv interaktion i 3D-rymden runt den smarta
enheten. Hog noggrannhet och robust 3D rorelseanalys av hand/kropp
gester utfors for att underlatta 3D-interaktion i olika tillampningsscenar-
ier. De foreslagna teknik gor det mojligt for anvandare att kontrollera,
manipulera, och organiserar digitalt innehall i 3D-rymden.
Nyckelord: 3D gest interaktion, gest igenkannande, gest sparning, 3D
visualisering, 3D rorelseanalys, utokade miljoer.
Shahrouz Yousefi
februari 2014
Acknowledgements
First of all, I wish to express my sincere gratitude to my main ad-
visor, Prof. Haibo Li, for providing me this research opportunity.
Thank you for your motivation, enthusiasm, and support during
these years. Without your supervision and mentoring this thesis
would not have been possible. You inspired me to be more adven-
turous in research.
I would like to thank my second advisor, Dr. Li Liu, for all the
motivational and fruitful discussions. Special thanks to my dear
friend and colleague, Farid Kondori. We had many collaborations,
interesting discussions and enjoyable moments during these years.
I would like to thank my former colleagues at Digital Media Lab,
Umea University for their helpful suggestions and comments on my
research projects. Special thanks to Annemaj Nilsson, Mona-Lisa
Gunnarsson, and the friendly staff of the department of Applied
Physics and Electronics, Umea University.
My time at KTH was really enjoyable due to the friendly col-
leagues of the department of Media Technology and Interaction
Design. I am grateful for the time spent with them at work meet-
ings, seminars and social events. I must especially thank Prof. Ann
Lantz for providing an excellent research environment at the MID
department. Thanks for your support, encouragement and kind-
ness. I would also like to thank Henrik Artman, Cristian Bogdan,
Ambjorn Naeve, Olle Balter, Eva-Lotta Sallnas, and other senior
researchers at MID department for their support and guidance.
Many thanks should go to Dr. Roberto Bresin and Prof. Yngve
Sundblad for reviewing my thesis. Your constructive ideas, in-
sightful comments, and suggestions made a great improvement in
the quality of my PhD thesis.
Winning the first prize in KTH Innovation Idea Competition, best
project work in Uminova Academic Business Challenge, and being
selected as one of the top PhD works at ACM Multimedia Doctoral
Symposium, motivated me to work harder on the development of
my research ideas. I would especially like to thank Hakan Borg
and Cecilia Sandell from KTH Innovation for their great support
on patentability analysis, business development and commercial-
ization of my research results.
Finally, and most importantly, I am grateful to my loving par-
ents, my brother, and his family for giving me the endless intellec-
tual support and encouragement to pursue my studies during these
years. I would especially like to thank my best friend and com-
panion Shora. Thanks for the wonderful and precious moments we
shared together.
Shahrouz Yousefi
February 2014
Contents
Contents v
1 Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Research Problem . . . . . . . . . . . . . . . . . . . . . . . . . 3
1.2.1 Future Mobile Devices . . . . . . . . . . . . . . . . . . . 4
1.2.2 Experience Design . . . . . . . . . . . . . . . . . . . . . 4
1.2.3 Limitations in Interaction Facilities . . . . . . . . . . . . 6
1.2.4 Limitations in Visualization . . . . . . . . . . . . . . . . 8
1.2.5 Technical Challenges in 3D Gestural Interaction . . . . 8
1.3 Future Trends in Multimedia Context . . . . . . . . . . . . . . 10
1.3.1 3D Interaction Technology . . . . . . . . . . . . . . . . . 10
1.3.2 3D Visualization . . . . . . . . . . . . . . . . . . . . . . 10
1.3.3 Passive Vision to Active/Interactive Vision . . . . . . . 10
1.3.4 Gesture Analysis: from Computer Vision Methods to
Image-based Search Methods . . . . . . . . . . . . . . . 11
1.4 Research Strategy . . . . . . . . . . . . . . . . . . . . . . . . . 11
2 Related Work 13
2.1 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13
2.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
2.2.1 3D Motion Capture Technologies in Available
Interactive Systems . . . . . . . . . . . . . . . . . . . . . 15
v
CONTENTS
2.2.1.1 Passive Motion Tracking and Its Applications 15
2.2.1.2 Active Motion Tracking and Its Applications . 16
2.2.1.3 Comparison Between Active and Passive Meth-
ods . . . . . . . . . . . . . . . . . . . . . . . . 16
2.2.2 3D Motion Estimation for Mobile Interaction . . . . . . 18
2.2.3 3D Gesture Recognition and Tracking . . . . . . . . . . 19
2.2.4 3D Visualization on Mobile Devices . . . . . . . . . . . 21
3 General Concept and Methodology 23
3.1 General Concept . . . . . . . . . . . . . . . . . . . . . . . . . . 23
3.1.1 Interaction/Visualization Space . . . . . . . . . . . . . . 25
3.1.2 Sharing the Interaction/Visualization space . . . . . . . 27
3.1.2.1 Single-user, Single-device . . . . . . . . . . . . 27
3.1.2.2 Multi-user, Multi-device with Shared Interac-
tion Space . . . . . . . . . . . . . . . . . . . . 28
3.1.2.3 Multi-user, Single-device with Shared Visual-
ization Space . . . . . . . . . . . . . . . . . . . 28
3.1.2.4 Interaction from Different Locations for Multi-
user Multi-device . . . . . . . . . . . . . . . . . 28
3.2 Evolution of Interaction/Visualization Spaces . . . . . . . . . . 28
3.3 Enabling Media Technologies . . . . . . . . . . . . . . . . . . . 31
3.3.1 Vision-based Motion Tracking in 3D Space . . . . . . . 32
3.3.2 3D Visualization . . . . . . . . . . . . . . . . . . . . . . 33
3.4 Methodology Overview . . . . . . . . . . . . . . . . . . . . . . . 36
3.5 Gesture Analysis through the Pattern
Recognition Methods . . . . . . . . . . . . . . . . . . . . . . . . 38
3.6 Gesture Analysis through the Large-scale
Image Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4 Enabling Media Technologies 43
4.1 Gesture Detection and Tracking Based on Low-level Pattern
Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44
vi
CONTENTS
4.1.1 3D Motion Analysis . . . . . . . . . . . . . . . . . . . . 46
4.2 Gesture Detection and Tracking Based on
Gesture Search Engine . . . . . . . . . . . . . . . . . . . . . . . 48
4.2.1 Providing the Database of Gesture Images . . . . . . . . 49
4.2.2 Query Processing and Matching . . . . . . . . . . . . . 50
4.2.3 Scoring System . . . . . . . . . . . . . . . . . . . . . . . 50
4.2.4 Quality of Hand Gesture Database . . . . . . . . . . . . 52
4.3 Interactive 3D Visualization . . . . . . . . . . . . . . . . . . . . 54
4.4 Methods for 3D Visualization . . . . . . . . . . . . . . . . . . . 56
4.4.1 Depth Recovery and 3D Visualization from a Single View 56
4.4.2 3D Visualization from Multiple 2D Views . . . . . . . . 57
4.5 3D Channel Coding . . . . . . . . . . . . . . . . . . . . . . . . 57
5 Experimental Results 59
5.1 Experiments on Gesture Detection, Tracking
and 3D Motion Analysis . . . . . . . . . . . . . . . . . . . . . . 59
5.1.1 Camera and Experiment Condition . . . . . . . . . . . . 59
5.1.2 Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.1.3 Programming Environment and Results . . . . . . . . . 62
5.2 Experiments on Gesture Search Framework . . . . . . . . . . . 63
5.2.1 Constructing the Database . . . . . . . . . . . . . . . . 63
5.2.2 Forming the Vocabulary Table . . . . . . . . . . . . . . 65
5.2.3 Gesture Search Engine and Neighborhood Analysis . . . 66
5.2.4 Gesture Search Results . . . . . . . . . . . . . . . . . . 66
5.3 Technical Comparison between the Prior Art and the Proposed
Solutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.4 3D Rendering and Graphical Interface . . . . . . . . . . . . . . 69
5.5 Research Scenarios . . . . . . . . . . . . . . . . . . . . . . . . . 71
5.5.1 Implementation of the 3D Gestural Interaction on Mo-
bile Platform . . . . . . . . . . . . . . . . . . . . . . . . 71
5.5.2 Implementation of the Interactive 3D Vision on a Wall-
sized Display . . . . . . . . . . . . . . . . . . . . . . . . 72
vii
CONTENTS
5.5.3 3D Rendering and Visualization of 2D Content . . . . . 73
5.6 Potential Applications . . . . . . . . . . . . . . . . . . . . . . . 74
5.6.1 3D Photo Browsing . . . . . . . . . . . . . . . . . . . . 75
5.6.2 Virtual/Augmented Reality . . . . . . . . . . . . . . . . 75
5.6.3 Interactive 3D Display . . . . . . . . . . . . . . . . . . . 76
5.6.4 Medical Applications . . . . . . . . . . . . . . . . . . . . 76
5.6.5 3D Games . . . . . . . . . . . . . . . . . . . . . . . . . . 76
5.6.6 3D Modeling and Reconstruction . . . . . . . . . . . . . 77
5.6.7 Wearable AR Displays . . . . . . . . . . . . . . . . . . . 77
5.7 Usability Analysis in Object Manipulation:
Touchscreen Interaction vs. 3D Gestural Interaction . . . . . . 77
5.7.1 User Test . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.7.2 Usability Results . . . . . . . . . . . . . . . . . . . . . . 80
6 Concluding Remarks and Future Direction 83
6.1 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83
6.1.1 Conceptual Models for Future Human Mobile Device
Interaction . . . . . . . . . . . . . . . . . . . . . . . . . 84
6.1.2 Technical Contributions for 3D Gestural Interaction and
3D Interactive Visualization . . . . . . . . . . . . . . . . 84
6.1.3 Implementations . . . . . . . . . . . . . . . . . . . . . . 85
6.2 Concluding Remarks and Future Direction . . . . . . . . . . . . 86
6.2.1 Technical Challenges . . . . . . . . . . . . . . . . . . . . 88
6.2.1.1 Active vs. Passive Motion Capture . . . . . . . 88
6.2.1.2 Gesture Detection and Tracking without Intel-
ligence . . . . . . . . . . . . . . . . . . . . . . 89
6.2.1.3 Adaptability of the Contributions to Future
Hardware Evolution . . . . . . . . . . . . . . . 89
6.2.1.4 Contributions of other Research Areas to Com-
puter Vision . . . . . . . . . . . . . . . . . . . 90
6.2.2 Further Development . . . . . . . . . . . . . . . . . . . . 90
6.2.2.1 Concept of Collaborative 3D Interaction . . . . 91
viii
CONTENTS
6.2.2.2 Concept of Interaction in the Space using Body
Gestures . . . . . . . . . . . . . . . . . . . . . 91
6.2.2.3 Extension of the Gesture Search Framework to
Extremely Large Scale . . . . . . . . . . . . . . 91
6.2.3 Future of Mobile Interaction and Visualization . . . . . 92
7 Summary of the Selected Articles 95
7.1 List of Publications . . . . . . . . . . . . . . . . . . . . . . . . . 97
8 Paper I:
Experiencing Real 3D Gestural Interaction with Mobile De-
vices 105
8.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
8.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
8.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108
8.4 System Description . . . . . . . . . . . . . . . . . . . . . . . . . 110
8.4.1 Gesture Detection and Tracking . . . . . . . . . . . . . 111
8.4.2 Local Orientation and Double-angle Representation . . 111
8.4.3 Rotational Symmetries Detection . . . . . . . . . . . . . 113
8.4.4 3D Structure from Motion . . . . . . . . . . . . . . . . . 115
8.4.5 Finger Detection and Tracking . . . . . . . . . . . . . . 116
8.4.5.1 Fingertip Detection . . . . . . . . . . . . . . . 117
8.4.5.2 Localization by Clustering . . . . . . . . . . . 118
8.4.5.3 Finger Tracking . . . . . . . . . . . . . . . . . 119
8.4.6 3D Coding and Visualization . . . . . . . . . . . . . . . 119
8.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 121
8.6 Usability of the Proposed System . . . . . . . . . . . . . . . . . 124
8.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
9 Paper II:
ix
CONTENTS
3D Photo Browsing for Future Mobile Devices 133
9.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
9.2 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
9.3 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 135
9.4 Enabling Media Technologies . . . . . . . . . . . . . . . . . . . 137
9.4.1 Vision-based Motion Tracking in 3D Space . . . . . . . 137
9.4.2 3D Visualization . . . . . . . . . . . . . . . . . . . . . . 138
9.5 Design of the 3D Photo Browser . . . . . . . . . . . . . . . . . 139
9.6 Technical Contributions . . . . . . . . . . . . . . . . . . . . . . 142
9.6.1 Gesture Detection and Tracking . . . . . . . . . . . . . 142
9.6.2 3D Motion Analysis . . . . . . . . . . . . . . . . . . . . 143
9.6.3 Methods for 3D Visualization . . . . . . . . . . . . . . . 143
9.7 Concluding Remarks . . . . . . . . . . . . . . . . . . . . . . . . 144
9.8 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 144
10 Paper III:
Bare-hand Gesture Recognition and Tracking through the
Large-scale Image Retrieval 147
10.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
10.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
10.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 150
10.4 System Description . . . . . . . . . . . . . . . . . . . . . . . . . 152
10.4.1 Pre-processing on the Database . . . . . . . . . . . . . . 152
10.4.1.1 Position/Orientation Tagging to the Database 152
10.4.1.2 Defining and Filling the Edge-orientation Table 155
10.4.2 Query Processing and Matching . . . . . . . . . . . . . 156
10.4.2.1 Direct Scoring . . . . . . . . . . . . . . . . . . 156
10.4.2.2 Reverse Scoring . . . . . . . . . . . . . . . . . 159
10.4.2.3 Weighting the Second Level Top Matches . . . 160
10.4.2.4 Dimensionality Reduction for Motion Path Anal-
ysis . . . . . . . . . . . . . . . . . . . . . . . . 160
x
CONTENTS
10.4.2.5 Motion Averaging . . . . . . . . . . . . . . . . 161
10.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 162
10.5.1 Dimensionality Reduction for Selective Search . . . . . . 166
10.6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . 167
11 Paper IV:
Interactive 3D Visualization on a 4K Wall-Sized Display 173
11.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
11.2 Introduction and Related Work . . . . . . . . . . . . . . . . . . 174
11.3 3D Motion Analysis . . . . . . . . . . . . . . . . . . . . . . . . 178
11.4 Error Analysis in 3D Motion Estimation . . . . . . . . . . . . . 180
11.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 182
11.5.1 Visualization on a 4K Wall-sized Display . . . . . . . . 183
11.6 Conclusion and Future Work . . . . . . . . . . . . . . . . . . . 183
12 Paper V:
3D Visualization of Single Images Using Patch Level Depth 187
12.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
12.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 188
12.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 189
12.4 Monocular Features for Depth Estimation. . . . . . . . . . . . . 190
12.5 Feature Vector . . . . . . . . . . . . . . . . . . . . . . . . . . . 191
12.6 MRF and Depth Map Recovery . . . . . . . . . . . . . . . . . . 193
12.7 Depth Normalization and Pixel Level Translation . . . . . . . . 194
12.8 Anaglyph 3D Coding . . . . . . . . . . . . . . . . . . . . . . . . 195
12.9 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 196
12.10Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 198
13 Paper VI:
xi
CONTENTS
Stereoscopic Visualization of Monocular Images in Photo Col-
lections 201
13.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
13.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 202
13.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 203
13.4 System Description . . . . . . . . . . . . . . . . . . . . . . . . . 204
13.4.1 SIFT Feature Detection and Matching . . . . . . . . . 204
13.4.2 Image Transformation . . . . . . . . . . . . . . . . . . . 205
13.4.3 Image Projection and Stereoscopic Adjustment . . . . . 206
13.4.4 3D Coding and Visualization . . . . . . . . . . . . . . . 207
13.5 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 208
13.6 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 210
14 Paper VII:
Robust Correction of 3D Geo-Metadata in Photo Collections
by Forming a Photo Grid 213
14.1 Abstract . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
14.2 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214
14.3 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 216
14.4 System Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 216
14.5 System Description . . . . . . . . . . . . . . . . . . . . . . . . . 218
14.5.1 Pre-processing . . . . . . . . . . . . . . . . . . . . . . . 218
14.5.2 Structure from Motion . . . . . . . . . . . . . . . . . . . 219
14.5.3 Uncertainty Analysis . . . . . . . . . . . . . . . . . . . . 221
14.5.4 Signal Model . . . . . . . . . . . . . . . . . . . . . . . . 223
14.5.5 Measurement Model . . . . . . . . . . . . . . . . . . . . 224
14.5.6 Data Fusion . . . . . . . . . . . . . . . . . . . . . . . . . 225
14.6 Experimental Results . . . . . . . . . . . . . . . . . . . . . . . . 225
14.7 Discussion and Conclusion . . . . . . . . . . . . . . . . . . . . . 226
Bibliography 235
xii
Chapter 1
Introduction
1.1 Motivation
Mobile devices play an important role in the modern world. Beyond the or-
dinary use in daily life, they are being used by people for various advanced
purposes in scientific areas, entertainment, education, medical applications,
communication, gaming etc. The fast growing market of mobile devices re-
veals that the sales of mobile devices are overtaking PCs. The recent statistics
on mobile device market indicate that the total smartphone sales has reached
490 million units in 2011 and 700 million units in 2012 across the globe [1, 2].
With the current rate of growth, the sales of smartphones will exceed 1.5 billion
in 2017 [3]. In addition to this enormous number, we should take into account
the other types of mobile devices such as tablets, advanced portable gaming
devices, digital cameras, camcorders, multimedia players, smart watches, and
augmented reality glasses.
Capability of mobile devices in capturing, storing, processing, and visualiza-
tion of multimedia content has been significantly increased during the recent
years. In addition to the high resolution cameras, other embedded sensors
such as GPS, accelerometer, gyroscope and magnetometer provide a chance
to collect extra metadata and integrate them in a wide range of application
scenarios. In addition, variety of sensors might be considered as alternative
1
Chapter 1
input facilities for interaction between users and their mobile devices.
Introduction of smartphones has changed our way to interact with mobile
phones. Nowadays, people interact with their mobile devices through the
touchscreen displays. The current technology offers single or multi-touch ges-
tural interaction on 2D touchscreens. This approach is designed to provide a
more natural interaction when users are operating their mobile devices. Over
the touchscreen of a smartphone, users could manipulate a soft keyboard, vir-
tual objects, and perform actions just through moving fingers. Although this
technology has solved many limitations in human mobile device interaction,
the recent trend in the digital world reveals that people always prefer to have
intuitive experiences with their digital devices. For instance, popularity of the
Microsoft Kinect can demonstrate this idea that people enjoy experiences that
give them the freedom to act like they would in the real world.
Rapid development and wide adaptation of smart phones have greatly changed
our lives. Nowadays, we are more and more relying on our smartphones. It is a
strong trend that a smartphone will become a part of our body. An indicative
example is Google Glass, which can be seen as a version of next generation
smartphones. Most probably, for next generation smartphones, users will no
longer be satisfied with just performing interaction over 2D touchscreen, they
will demand more natural interactions performed by the bare hands in 3D
free space, at the back of the phone, or in front of the smart device, for in-
stance. Thus, next generation of smart devices will need a gesture interface to
facilitate the bare hands for manipulating digital objects directly, for instance,
playing Spotify, scanning photo collections, and reading emails.
Due to the strong indications and current trends, mobile devices will be an
essential inseparable part of our body in the near future. In fact, smartphones,
tablets or wearable augmented reality glasses will not be just ordinary devices.
They will bring any experience to the personalized visualization space from
the huge sea of information. For instance, mobile device might be a guitar,
fitness trainer, home theater, shopping center, navigation system, game con-
sole and thousands of other possible scenarios.
2
Introduction
Currently the major discussion is how we interact with the mobile device,
while in the near future we should also consider how we interact through the
mobile device, with the physical space, objects, information, etc. However,
when we discuss about the next generation of mobile devices we should con-
sider the next generation of interaction facilities too. The important question
is in which space and how will we interact with and through our future mobile
devices?
1.2 Research Problem
Design of the interaction experience for future mobile devices incorporates
many challenging problems. The rapid growth in the technology of mobile
devices shows that in the near future we will have extremely powerful hand-
held or wearable devices. Although it is rather hard to exactly predict the
hardware capabilities and features of the future mobile devices, but current
trends in multimedia technology indicate that interaction with future devices
should happen in a more intuitive and natural manner. Here, some important
scientific questions might be considered. First, in which space and how should
the intuitive interaction happen? Will touchscreens and track pads be replaced
by other input facilities? And how should we design a new space for intuitive
interaction with future mobile devices?
Intuitive interaction is highly related to the mental connection of humans
to their natural experiences. Since humans interact with their environment
through the physical gestures, 3D hand/body gestures might be the effective
alternatives for existing interaction facilities.
Assume that the new interaction space is designed and introduced. Now the
main challenge is how to technically support this concept. What types of
technologies are required to perform this significant change? What are the
limitations of media technologies to perform the 3D gestural interaction? How
can we detect, recognize and track the complex hand gestures, head motions
and body movements in 3D space? And how can enabling media technologies
solve the technical problems?
3
Chapter 1
However, from both design and technical perspectives, introducing new ways
for interaction with future mobile devices seems to be an extremely challenging
task. This thesis aims to tackle these challenges and introduce new concepts,
designs and technical solutions for the mentioned issues. These challenges are
discussed in the following sections in detail.
1.2.1 Future Mobile Devices
In the discussions of future mobile devices we have to consider some impor-
tant points. Five to ten years from now, we will most probably be faced with
substantial changes in mobile technology. From a hardware point of view, fu-
ture mobile devices will be featuring more advanced and powerful components
such as various types of sensors, high speed processors, 3D displays, and huge
memories. It seems that user experience in interaction with mobile devices will
be quite different from now. Therefore, designing any interactive application
for future mobile devices needs extensive investigation and research. From a
design point of view, interaction environment and visualization quality will
be substantially changed in the near future. Here, the main challenge is how
to design a usable system to enhance the user experience in interaction with
future mobile devices.
1.2.2 Experience Design
In the multimedia context, experience is defined as the sensation of interaction
with a product, service or event [4]. Therefore, in experience design for mo-
bile users, quality and sensation of interaction on physiological, affective, and
cognitive levels should be taken into account. In fact, for a more convenient
and desired experience, interaction between user and device should happen in
a natural and effective manner. Unlike the interaction with the physical world
where people use their body gestures, the best available technologies in smart-
phones and tablets use interaction on limited 2D touchscreens. This limitation
stops users from having a natural interaction in a wide range of applications
where using physical gestures for 3D manipulations might be unavoidable. For
4
Introduction
instance, picking, placing, grabbing, moving, pushing, zooming, and in gen-
eral, manipulating virtual menus, objects and graphics in 3D environments
require physical hand gestures. In addition, because the interaction happens
on the display, in practice, users’ fingers or hands cover a large area or some
parts of the screen while they are operating on the device. As a result, they
will lose the visibility of the display during the interaction. Since the hard-
ware capability of the mobile devices is increasing rapidly, the complexity of
the applications will increase as well. This means that in the near future we
will interact with our digital devices in a quite different manner. Another im-
portant point to consider is the visualization quality. For a high quality user
perception, a realistic visualization is needed. This is the main idea behind
the development of 3D display technologies such as 3D cinemas or TVs. It
is predictable that in the future, multimedia content will be displayed in 3D
format. Therefore, adaptation of the old content to the future visualization
technologies should be considered. For instance, we need to find an effective
way to convert our old multimedia collections such as 2D photo albums, videos,
etc. to 3D, which is quite a challenging problem. However, experience design
for future mobile devices seems to be a difficult task from both interaction and
visualization perspectives.
Quality of user experience is rather a difficult concept to define, measure and
evaluate. Although substantial research has been done on this subject, find-
ing a straightforward method to measure the quality of experience(QoE) is
still challenging. Usability is an important criterion to consider when we in-
vestigate the QoE concept. Usability might be perceived from three angles:
efficiency, effectiveness and user satisfaction [5]. From technical point of view,
these three factors have been found to be more practical to measure and eval-
uate. Therefore, improving the usability factors might significantly enhance
the quality of user experience in multimedia context.
5
Chapter 1
1.2.3 Limitations in Interaction Facilities
Designing interactive applications for mobile devices is still a challenging prob-
lem. Although from the processing point of view, new devices are quite pow-
erful, but due to the limitations in size and weight for portability purpose,
many problems remain unsolved. One major problem is that how users can
effectively communicate with their devices in hardware level. The current
technology provides several solutions to this problem. The commonly used
hardware facilities to communicate with mobile devices are miniature key-
boards, tiny joysticks and touchscreen displays [6].
Keyboards allow users to perform tasks through the menus, type, search, nav-
igate etc., but in reality, even a small-sized keyboard, occupies a large space
and limits the area for display. Secondly, usability of those keyboards is ques-
tionable for users with large fingers for selecting tiny buttons. Substantial
research has been done to reduce the size of keyboards, for example simplify-
ing the devices by mapping the QWERTY keyboards to other formats or using
few or single buttons [7, 8]. Swype is another known technique for enhancing
the interaction with mobile device through the virtual keyboard. In Swype
virtual keyboard user enters words by sliding a finger from the first letter of a
word to its last letter, lifting only between words.
Although joysticks are useful in some applications for scrolling up and down
and selecting menus, but they are very limited and difficult to work with, es-
pecially on small screens. Nowadays, the touchscreen displays are being used
by most smartphones and tablet PCs and the trend shows that the button-
less devices are becoming more popular. This indicates that users prefer to
work on larger screens and designers allocate the whole device’s surface to
the touchscreen display. On the other hand, touchscreen displays have several
drawbacks. First, for typing scenarios a virtual keyboard will be rendered on
the screen that occupies a large space for user convenience. Second, in most
applications at least one hand works on the surface which brings the occlusion
problem and in many cases both hands are involved. Therefore, the occlu-
sion limits the visualization and the quality of experience will be degraded.
6
Introduction
In some contributions, a novel approach by touching the back of the device
is presented [9]. Although this solution might work on limited scenarios but
generally users lose the matching between visual perception and the touch.
The commonly used technology in interaction design with mobile devices is
2D touchscreen display with a single button or without any physical button.
Since humans interact with the physical world in 3D space, the quality of
interaction will be degraded while it is mapped to 2D surfaces. Technically
speaking, in 3D space the motion is represented by six degrees of freedom
(three rotation parameters and three translation parameters), while in the
mapping from the real world to 2D screens, the motion parameters will be
reduced to two. On single-touch displays, motion is limited to translation in
2D (x and y) coordinates, but new products in the market, use multi-touch
gestures to simulate rotations and translations in z-axis. However, with the
best interactive devices in the market, the motion parameters are limited on
2D screens. This means that without using extra buttons, 2D gestures, or
the aid of embedded orientation sensors, manipulation around x,y axes on 2D
screens is not possible. Due to the fact that in magnetometer aided applica-
tions the device itself should move, the visual content such as graphics, photos,
videos, etc. might be out of the user’s sight and it will not be applicable in
most cases.
Another important group of mobile devices are the forthcoming augmented
reality glasses such as Google Glass. Since this type of gadgets might be the
future of smartphones, it is quite important to investigate how convenient users
interact with them. Voice commands are one effective solution for command-
ing the device to take a set of actions. For instance, for dialing a number,
searching a phrase, or capturing a photo, voice commands might be really
useful. For more complex tasks such as writing a text, skimming emails or
browsing photos, users definitely need more input facilities. Google has intro-
duced a touchscreen bar on the side frame of the Glass to solve this problem.
Although this small surface provides more capability for user interaction, but
it is obviously weaker than current smartphone touchscreens due to its size
7
Chapter 1
and invisibility to the users’ sight.
However, designing usable and convenient input facilities for future mobile
devices incorporates deep research and investigation. 3D gestural interaction
in free space might be an effective alternative for the current interaction tech-
nology. Enabling media technologies can support the user device interaction
and enhance the user experience.
1.2.4 Limitations in Visualization
In order to improve the user experience in multimedia applications, in ad-
dition to the effective interaction, high quality and realistic visualization is
required. The main reason behind the manufacturing of mobile devices with
larger displays is to enhance the visual output and the quality of experience.
Although size and mobility have been in a trade-off in design for many years,
due to the importance of the visual interaction, mobile devices became larger
in screen size. Substantial experiments have been done to find the optimal
and most effective size for different mobile devices [6]. A mobile device, as it
is named, should be portable and easy to be held by its user, and this criterion
brings the most challenging task to keep the balance between portability and
size, besides the power consumption issue that is out of our discussion [10].
However, today’s smartphones just offer a limited surface for visualization. If
wearable smart glasses provide a high quality visualization they might signifi-
cantly enhance the visual experience. 3D display technology is another feature
that might improve the perception quality in future mobile devices.
1.2.5 Technical Challenges in 3D Gestural Interaction
3D gestural interaction is rather a new trend in the multimedia context. Sub-
stantial efforts have been done in this area. Specifically, 3D gestural interfaces
are used in gaming and entertainment applications. One of the enabling tech-
nologies to build such gesture interfaces is hand tracking and gesture recogni-
tion. The major technology bottleneck lies in the difficulty of capturing and
analyzing the articulated hand motions. One of existing solutions is to em-
8
Introduction
ploy glove-based devices, which directly measure the finger joint angles and
spatial positions of the hand by using a set of sensors (i.e. electromagnetic or
fiber-optical sensors). Although there exits such applications in human com-
puter interaction, virtual reality, and 3D games, the glove-based solutions are
too intrusive, cumbersome, and expensive for natural interactions with mobile
devices. To overcome this, vision-based hand motion capturing and tracking
solutions need to be developed. Capturing hand and finger motions in video
sequences is a highly challenging task due to the large number of degrees of
freedom (DOF) of the hand kinematics. Tracking articulated objects through
sequences of images is one of the grand challenges in computer vision. Re-
cently, Microsoft demonstrated how to capture full body motions by means
of their newly developed depth cameras, Kinect. The question is whether the
problem of 3D hand tracking and gesture recognition can be potentially solved
by using 3D depth cameras. Of course, this problem has been greatly simpli-
fied by the introduction of real-time depth cameras. However, the technologies
based on 3D depth information for hand tracking and gesture recognition still
face major challenges for mobile applications.
Mobile applications at least have two critical requirements: computational
efficiency and robustness. For mobile applications, feedback and interaction
in a timely fashion is assumed. Any latency should not be perceived as un-
natural to the human participant. Therefore, the maximum time between
the completion of a gestural action from a person and response from the de-
vice must be no longer than 100 ms (at least 10 frames per second should
be processed in real-time vision-based systems). This requires an extremely
fast solution for hand tracking and gesture recognition. It is doubtful if most
existing technical approaches, including the one used in Kinect body tracking
system would be the direction leading to the technical development for future
mobile devices due to their inherent resource-intensive nature. Another issue
is the robustness. The solutions for mobile applications should always work no
matter indoor or outdoor. This may somehow exclude the possibility of using
Kinect-type depth sensors in the next generation of mobile devices. Therefore,
9
Chapter 1
we come back to our original problem again, how to solve the problem of hand
tracking and gesture recognition with video cameras.
A critical question is whether we could develop alternative video-based so-
lutions to hand tracking and gesture recognition that may fit future mobile
applications better. Obviously, this question is of significance to address, since
it is not only one of the fundamental problems in computer vision, but also it
would have a potential impact on the mobile industry, and above all on the
interaction with mobile devices in the future.
1.3 Future Trends in Multimedia Context
Contributions of this thesis are highly inspired by the following key trends in
the multimedia context.
1.3.1 3D Interaction Technology
Major direction of interaction technology is towards more intuitive and natural
interaction between users and digital devices. This is the main reason that
keyboards, joysticks and other traditional input facilities are mainly replaced
by track pads and touchscreens. The rapid development of new sensors such
as Microsoft Kinect and Leap Motion for 3D interaction is another indication
that proves this trend.
1.3.2 3D Visualization
Visualization quality has been significantly improved during the recent decade.
The major trend indicates that 2D display devices are replacing with 3D tech-
nology. Realistic perception and quality of experience are important features
introduced by 3D display technology.
1.3.3 Passive Vision to Active/Interactive Vision
Introduction of wearable smart displays such as Google Glass reveals a new
trend in multimedia world. Augmented reality glasses or similar products will
10
Introduction
change the user perception from passive to active or interactive vision. In
fact, users might interact with the environment through the wearable display,
receive information from various channels and command the display.
1.3.4 Gesture Analysis: from Computer Vision Methods toImage-based Search Methods
Gesture detection, recognition, and tracking are mainly considered as classical
computer vision/pattern recognition problems. Capability of new devices in
storing and processing of large databases motivates the idea of solving the
mentioned problem by using image-based search approaches. Therefore, de-
velopment of search methods for visual content might be the future approach
for gesture analysis.
1.4 Research Strategy
The main objective behind this research is to develop concepts and technolo-
gies for effective interaction with future mobile devices. In order to fulfill this
objective, challenges from both design and technical aspects should be consid-
ered. This thesis aims to cover the concept of human mobile device interaction,
future challenges and new possible frameworks to improve the quality of user
experience. Afterwards technical solutions to overcome these challenges will be
introduced and experimental results will be demonstrated. The main research
strategies towards achieving these goals can be summarized in the following
items:
ä Concepts of interaction and visualization spaces have been deeply investi-
gated during this work. Idea of extending the interaction and visualization
spaces to real 3D space, sharing the interaction/visualization spaces and po-
tential application scenarios are introduced.
ä In order to support the future interaction/visualization concept, enabling
media technologies have been deeply studied and widely used during this re-
search.
ä 3D gestural interaction as a powerful tool for future interaction technology
11
Chapter 1
is suggested. Different methods for gesture detection, recognition, tracking,
and 3D motion analysis have been studied and new algorithms for supporting
this concept have been developed during this research.
ä Concepts of active and passive vision have been investigated and studied
during this work. An interactive framework for 3D displays has been intro-
duced. Moreover, new methods for 3D visualization of multimedia content
have been developed.
ä Various implementations of gestural interaction and 3D visualizations and
different experiments have been conducted on stationary and mobile platforms.
Experimental results have been compared and the final conclusions have been
reflected.
ä Future direction of mobile multimedia, potential application scenarios, en-
abling technologies, and new frameworks have been investigated.
12
Chapter 2
Related Work
2.1 Terminology
Nowadays, gesture-based interaction is a strong trend in the multimedia con-
text. In general, effective 3D gestural interaction can be achieved by combining
the technical solutions in gesture analysis, designing usable applications and
efficient user interface. Since the main focus of this thesis is on the technical
aspects of gesture analysis, it is quite crucial to provide a comprehensive def-
inition for technical keywords and expressions that have been frequently used
in the discussions.
Gesture recognition: gesture recognition is the process of interpreting the
human gestures by mathematical models or computer vision algorithms. Ges-
ture recognition is widely considered for communication between users and
computers using sign language. Various hand gestures might be used for
commanding the digital devices in different tasks. In this context, gesture
recognition is the process of differentiating between various hand gestures and
assigning the different labels to them. For instance, all variations of the grab
gesture in different poses and orientations should be recognized as grab gesture.
Gesture detection: the process of detecting the presence of a gesture pat-
13
Chapter 2
tern in an image frame is known as gesture detection. In this context, for a
specific hand gesture, detection output indicates the presence or absence of
the gesture pattern in an image frame.
Gesture localization: the process of returning the estimated position of
the detected gesture in an image frame is known as gesture localization. The
location of the gesture might be returned by different parameters such as
bounding box, ellipse axes, or center of mass.
Gesture tracking: the process of gesture localization in a video sequence
is known as gesture tracking. Gesture tracking might be performed by gesture
localization in each single frame in an image sequence. In another approach,
following the motion of the localized gesture in consecutive frames might be
considered as gesture tracking.
Gesture pose estimation: estimating the position and orientation of the
detected gesture with respect to the camera origin is pose estimation. In the
discussions of this thesis 3D pose is referred to position (three parameters),
and orientation (three parameters), with respect to the camera coordinate sys-
tem.
3D gestural interaction: interaction between users and digital devices by
employing hand/body gestures in 3D space is regarded as 3D gestural inter-
action. Gesture detection, localization, recognition, tracking, and 3D pose
estimation are the essential components of gestural interaction.
2.2 Related Work
3D technology and the research area around it have been developed rapidly
during the recent years. Although substantial efforts have been done to facil-
itate the devices with 3D technology, there are still numerous problems and
challenges with them. Nowadays, the main focus of the 3D research world is
14
Related Work
on 3D visualization. For instance, in 3D cinemas, 3D TVs, 3D digital cameras
or even 3D mobile phones, the main goal is to add 3D features to the visual-
ization part. In this context, 3D technology is considered from two aspects.
First, how the interaction between user and device happens in 3D space, and
second, the visualization technology where digital output should be displayed
in 3D format.
In the following sections, first, current motion capture and tracking technolo-
gies in interactive products and environments are reviewed. Afterwards, ap-
plicability of the current technologies to mobile applications and the available
solutions and related works are discussed.
2.2.1 3D Motion Capture Technologies in AvailableInteractive Systems
During the recent years, 3D motion analysis has been found to be useful in
various scenarios such as entertainment, virtual/augmented reality, and med-
ical applications [11]. Different methods have been studied and introduced to
effectively retrieve the motion parameters. Since the aim is to interact with
mobile device through the user’s gesture in 3D space, different techniques in
3D motion tracking and analysis must be studied. If we successfully retrieve
the 3D motion parameters, with high accuracy, we will be able to design an
effective interaction environment for manipulation of digital content. In the
following sections, different approaches for analysis of the 3D motion captured
by various sensors in different setups are introduced and compared.
2.2.1.1 Passive Motion Tracking and Its Applications
The most common method to analyze the motion is using static cameras.
The tracking method is known as passive, where the cameras are static and
subjects move. The Microsoft Kinect is one example of passive motion analysis
where the sensor is mounted somewhere in the room and users move in front
of it. Kinect, features a RGB camera and depth sensor that provides full-body
3D motion capture and facial recognition [12]. Sony has also added motion
15
Chapter 2
capture systems in its game console [13]. PlayStation Move performs motion
capture by holding a wiggle stick in hand. This controller features a spherical
glowing part which can shine in full range of RGB colors. Based on the size
and position of the shining part, captured by the PlayStation camera in 3D
space, the motion will be accurately estimated [14, 15, 16]. Passive approach
is widely used in medical applications to analyze the patients’ motions for
diagnosing different types of physical disorders. Such systems usually use
several expensive cameras, mounted at different positions in the room, and
wearable markers or special clothes with visible markers on the body joints to
be detected by the cameras from the distance [17]. In the systems that work
without any marker or wearable devices, usually additional sensors like 3D
cameras, depth or distance sensors are added to the installation or capturing
device [17].
2.2.1.2 Active Motion Tracking and Its Applications
Active motion capture configurations estimate the 3D motion by using wear-
able devices. Wearable devices might be different types of sensors for measur-
ing the 3D motion parameters such as orientation and acceleration changes.
They transmit the information to the base station for processing. The Wii
MotionPlus game controller performs motion analysis by active configuration.
The device incorporates gyroscope and accelerometer to accurately capture
and report the 3D motion [18]. In high accuracy virtual reality and medical
applications similar types of sensors are extensively used on body joints [17].
2.2.1.3 Comparison Between Active and Passive Methods
Active motion analysis usually provides more accurate results due to its higher
resolution in measuring the motion parameters. Since sensors are mounted on
body parts, they can measure the body motions with a higher resolution com-
paring with the passive installation where motion is captured from distance.
Based on the conducted research in [19], accuracy of the active motion cap-
ture is about 10 times higher than passive motion capture where a single RGB
16
Related Work
Figure 2.1: Passive and active motion tracking in gaming consols; Left: TheMicrosoft Kinect; Middle: PlayStation 3 Move; Right: Wii MotionPlus.
camera is used for capturing and measuring the 3D motion. Although the
measurement is highly dependent on the motion analysis techniques, but in
general, the substantial difference reveals that for accuracy reasons, body-
mounted sensors are preferred to passive configuration. A major drawback
with active motion capture is that wearable devices are usually uncomfortable
for users. Moreover, many active motion capture systems use specific instal-
lations, expensive materials and many sensors that will substantially increase
the total cost. On the other hand, passive systems suffer from high accuracy
due to the error caused by distant motion estimation. Apart from passive sys-
tems with wearable markers, marker-less systems use natural gesture analysis
and they are convenient for users.
Due to the fact that mobile devices are equipped with different types of sen-
sors, it is possible to make use of them in the active or passive configurations.
On one hand, when they are in motion, they can be used as an active sensor
(for instance by reading the orientation sensor or analysis of the video input).
On the other hand, in static setup, they might be considered as passive sen-
sors. Motion analysis of a moving object from the video input, captured by
the device’s camera, is an example of this configuration. This thesis is focused
on the gestural interaction behind the mobile device’s camera. Basically, this
scenario is similar to passive motion tracking using a vision sensor, but due to
17
Chapter 2
the close distance between the vision sensor and the user’s gesture, it can also
present the features of the active motion tracking.
2.2.2 3D Motion Estimation for Mobile Interaction
Currently, the most popular way to interact with mobile devices is through the
2D touchscreens. As mentioned before, touchscreen displays have limitations
in 3D application scenarios, and the idea to hire other types of sensors such as
orientation or vision sensors provides an opportunity to enhance the quality
of interaction. Generally, in HCI applications, different solutions have been
used to analyze the human body or gesture motion. The retrieved information
from motion analysis might be used to facilitate the interaction. Many solu-
tions are developed based on the marked gloves or markers on body joints (see
Fig. 2.2) [20, 21, 22, 23, 24, 25, 26]. Some of them perform gesture analysis
using depth sensors [27, 28, 29]. Model based approaches have been used in
many applications [30, 31]. Other solutions analyze the motion by means of
shape or temperature sensors [32, 33, 34], etc. Almost all the solutions are
developed for stationary systems with powerful components. Due to the limi-
tations on mobile devices, most of the proposed solutions will not be practical.
Limited power resources, cost, mobility, and size are important features that
make the design process for 3D interaction really difficult. New devices are
equipped with different types of integrated sensors (orientation sensor, optical
cameras, GPS, etc). Now the question is whether is it possible to use them
in an effective way to analyze the 3D motion? Generally, the answer is yes.
In many virtual reality and augmented reality applications, integrated sensors
are used to control the motion. In [35], rendered graphics is controlled by ori-
entation sensor. In [36, 37] vision sensors are hired to detect hand, gestures,
or different types of objects. The major weakness of all current technologies
is limitations in 3D motion analysis. Most of them are limited to object de-
tection algorithms to augment graphics or manipulate virtual objects. The
problem to be tackled is to analyze the six DOF motion in 3D space. There-
fore, when the real 3D interaction with mobile devices is discussed, it means
18
Related Work
Figure 2.2: Motion-based interaction using wearable markers and gloves. Left:visual markers; Middle: T(ether), motion tracking glove; Right: ShapeHand,motion capture device.
that all motion parameters in 3D space must be considered.
2.2.3 3D Gesture Recognition and Tracking
Existing algorithms of hand tracking and gesture recognition can be grouped
into two categories: appearance-based approaches and 3D hand model-based
approaches. Appearance-based approaches are based on a direct comparison
of hand gestures with 2D image features. The popular image features used to
detect human hands and recognize gestures include hand colors and shapes,
local hand features, optical flow and so on. The earlier proposed works on
hand tracking belong to this type of approaches [38, 39]. The drawback of
these feature-based based approaches is that clean image segmentation is gen-
erally required in order to extract the hand features. This is not a trivial task
when the background is cluttered, for instance. Furthermore, human hands
are highly articulated. It is often difficult to find local hand features due to
the self-occlusion, and some kinds of heuristics are needed to handle the large
variety of hand gestures. Instead of employing 2D image features to repre-
sent the hand directly, 3D hand model-based approaches use a 3D kinematic
hand model to render hand poses. An analysis-by-synthesis (ABS) strategy is
employed to recover the hand motion parameters by aligning the appearance
projected by the 3D hand model with the observed image from the camera,
and minimizing the discrepancy between them.
Generally, it is easier to achieve real-time performance with appearance-based
19
Chapter 2
approaches due to the fact of simpler 2D image features. However, this type of
approaches can only handle simple hand gestures, like detection and tracking
of fingertips. In contrast, 3D hand model based approaches offer a rich de-
scription that potentially allows a wide class of hand gestures. The bad news
is that the 3D hand model is a complex articulated deformable object with
27 DOF. To cover all the characteristic hand images under different views, a
very large image database is required. Matching the query images from the
video input with all hand images in the database is time-consuming and com-
putationally expensive. This is why the most existing 3D hand model-based
approaches focus on real-time tracking for global hand motions with restricted
lighting and background conditions.
For general mobile applications, we need to cover full range of hand gestures.
3D hand model-based approaches seem more promising. To handle the chal-
lenging exhaustive search problem in a high dimensional space of human hands,
the efficient index technologies used in information retrieval field, have been
tested. Zhou et al. proposed an approach that integrates the powerful text re-
trieval tools with computer vision techniques in order to improve the efficiency
for hand images retrieval [40]. An Okapi-Chamfer matching algorithm is used
in their work based on the inverted index technique. Athitsos et al. proposed
a method that can generate a ranked list of three-dimensional hand configura-
tions that best match an input image [41]. Hand pose estimation is achieved
by searching for the closest matches for an input hand image from a large
database of synthetic hand images. The novelty of their system is the ability
to handle the presence of clutter. Imai et al. proposed a 2D appearance-based
method by using hand contours to estimate 3D hand posture [42]. In their
method, the variations of possible hand contours around the registered typical
appearances are trained from a number of graphical images generated from a
3D hand model. A low-dimensional embedded manifold is created to overcome
the high computation cost of the large number of appearance variations.
Although the methods based on text retrieval are very promising, they are
too few to be visible in the field. The reason might be that the approach is
20
Related Work
too primary, or the results are not impressive due to the tests just over very
limited size of database. Moreover, it might be also a consequence of the suc-
cess of Kinect in real-time human body gesture recognition and tracking. The
statistical approaches (random forest tree, for example) adopted in Kinect
start dominating mainstream gesture recognition approaches. This effect is
enhanced by the introduction of a new type of depth sensor from the Leap
Motion company. This type of depth sensor can run at interactive rates (it
should process at least 10 frames/second) on consumer hardware and interact
with moving objects in real-time. Despite of its fantastic demo, Leap Motion
sensor cannot handle full range of human hand shapes and sizes. The main
reason is that such sensors usually detect and track the presence of fingertips
or points in free space when user’s hands enter the sensor’s field of view. In
fact, they can be used for general hand motion tracking.
Regarding the special requirements for mobile applications such as real-time
processing, low-complexity and robustness, it seems that a promising approach
to handle the problem of hand tracking and hand gesture recognition is to use
text retrieval technologies for search. In order to apply this technology to next
generation of mobile devices, a systematic study is needed regarding how text
retrieval tools should be applied to handle gesture recognition, particularly,
how to integrate the advanced image search technologies [43]. Surely, there
exist many powerful tools to overcome this problem. The key issue is how
to relate the vision-based gesture analysis to the large-scale search framework
and define a right problem. Once the right problem is defined, we can identify
and integrate right tools to form a powerful solution.
2.2.4 3D Visualization on Mobile Devices
3D visualization or 3D imaging refers to the techniques for conveying the illu-
sion of depth to the viewer’s eyes. First efforts on 3D imaging started around
mid-1800s [44, 45]. Most 3D vision systems are based on Stereoscopic vision.
Stereoscopic vision or Stereopsis is the process of conveying the 3D illusion
by making stereo images. Various techniques have been introduced in this
21
Chapter 2
Figure 2.3: Examples of available 3D mobile devices; Left: HTC EVO 3D;Right: LG Optimus 3D.
way. Old techniques such as parallel or crossed-eye viewing [46] without eye-
glasses, color anaglyphs and color codes using eye-glasses are widely used for
3D photography [47, 48]. During the recent years, 3D technology has become
popular in cinema industry and TV production. Passive technology using po-
larized glasses [49] and Active shutter glasses [50] are common approaches in
3D cinemas and 3D TVs respectively. Autostereoscopic 3D is another tech-
nology that requires no glasses. In this method, stereo images are transmitted
separately to each eye from the light source. Some advanced 3D displays also
provide limited number of views of a scene for more realistic 3D perception
while user is moving his/her head [51]. Popularity of the 3D displays has
been increased after 3D cinemas and 3D TVs attracted the public attention.
The current trend in manufacturing of the 3D devices shows that we should
expect a tremendous growth in this market for the coming few years. Mobile
device manufacturers have started releasing smartphones with 3D capabilities
and there are few 3D mobile phones available in the market (see Fig. 2.3).
Among different companies two famous mobile phone manufacturers, LG and
HTC, have introduced their 3D smartphones. Both devices use autostereo-
scopic technology and dual cameras for recording and displaying stereoscopic
images and videos [52, 53].
22
Chapter 3
General Concept andMethodology
3.1 General Concept
Intuitive interaction between multimedia users and digital devices is a desired
feature of future technology. Although introduction of breakthrough technolo-
gies such as iPhone dramatically changed the manner users interact with their
phones, users always demand more realistic ways to communicate with their
devices.
Currently, limitations of 2D touchscreens stop users from having a natural
interaction in a wide range of applications where using physical gestures is
unavoidable. For instance, 3D manipulation of graphical content such as 3D
rotations, zooming in/out, grabbing, pushing, moving etc. requires physical
3D space for intuitive interaction. Even the latest technologies of smartphones
and tablets are limited to touch capabilities for interaction on a limited 2D sur-
face. Moreover, occlusion problems caused by fingers and hands might degrade
the quality of interaction and visualization. However, various entertainment
and gaming applications are limited to using the rendered virtual buttons on
the touchscreen while they need real 3D manipulation in free space. For in-
stance, playing musical instruments such as virtual piano, guitar and drums
requires pushing, moving, and tapping in 3D space. Obviously, limiting the
23
Chapter 3
same tasks to 2D surface degrades the natural interaction. Rendered virtual
controllers on 2D surface limit the visualisation space and affect the quality of
user experience. 3D gestural interaction might be even more vital for forth-
coming augmented reality glasses since the dedicated surface for interaction is
shrunk and shifted to the frame.
This thesis aims to introduce a new space for interaction between user and
mobile device. The main idea is to shift the interaction space from 2D surface
to real 3D space around the device where the vision sensor can capture the
hand/body gestures. In other words, performing gestural interaction in 3D
space is proposed to solve the limitations of 2D interaction technology.
Delivering the experience of free-hand interaction to the mobile device users
can totally affect the mobile industry. Bare-hand 3D interaction enables users
to communicate with their mobile devices exactly in the same manner as they
do in the physical world with people, objects, etc. (In the same way they push
a button or they pick and rotate an object in the physical space). Analysis of
the users’ gestures from the video input might be used to control the ongoing
operation on the device. Users might perform a set of actions using different
hand gestures or they might control and manipulate virtual objects and but-
tons by their hand movements.
From the design point of view, the main goal is to enhance the quality of user
experience by designing a new interactive environment. Since the quality of
user experience is highly affected by the interaction design, by introducing
a new interaction space the main goal is to solve the current limitations of
interaction technology.
From the technical point of view, the aim is to introduce enabling technologies
for gesture recognition, tracking, and 3D motion analysis that can effectively
facilitate the interaction design in mobile devices and multimedia applications.
In addition, novel techniques in 3D visualization of multimedia collections such
as single images and photo albums is considered. Finally, other related imple-
mented techniques that might support the main contributions are included in
this study.
24
General Concept and Methodology
Figure 3.1: For a better experience design, interaction and visualization spacescan be extended to 3D space.
From the practical point of view, this work aims to demonstrate the con-
ducted experiments and implementations of the main contributions in real
applications. Different application scenarios such as photo browsing, graphi-
cal manipulation and 3D motion control are included in this work.
3.1.1 Interaction/Visualization Space
Miniature keyboards, tiny joysticks, and specifically, touchscreen surfaces are
various designed input facilities for mobile devices. Considering the today’s
mobile devices, it is clearly observable that interaction and visualization spaces
are located on one side of the device. Since the physical buttons have been
gradually removed from the mobile devices, the allocated surface for display
25
Chapter 3
Figure 3.2: Bare-hand 3D interaction with mobile device in the extendedinteraction/visualization space.
has been significantly increased. The major concern about today’s mobile
devices is the overlap between the interaction surface and display due to the
common designed surface for input and output modules.
Due to the fact that users prefer to keep their visual contact with input fa-
cilities, it makes sense to design them in this way. For instance, users want
to see which button they push or where they touch. On the other hand, this
configuration might cause some problems in visualization. Obviously, when we
work on touchscreen displays the occlusion problem might happen. Users lose
the visibility and the quality of experience will be degraded. A novel solution
for this problem is to extend the interaction and visualization spaces from 2D
surface to 3D space (see Fig. 3.1). This extension should be done in a way that
preserves the visual contact of user and the ongoing operation in interaction
with the device. Since mobile devices have at least one embedded camera at
the back side, it is possible to see the space behind the device from that vision
26
General Concept and Methodology
sensor. If we manage to interact with the device in 3D space behind the cam-
era, we can successfully extend the interaction space. Furthermore, interaction
in 3D space features substantial capabilities that can facilitate a wide range
of applications. For effective interaction in 3D space, advanced technologies
in 3D motion analysis should be developed. From the technical perspective,
interaction in 3D space can handle the limitations of 2D interaction on touch-
screen displays. On the other hand, 3D visualization technology such as 3D
coding and 3D displays might help to extend the visualization space from 2D
surface to 3D. 3D visualization technology conveys the illusion of depth and
3D perception to users (see Fig. 3.2).
3.1.2 Sharing the Interaction/Visualization space
One of the great advantages that 3D interaction offers is the possibility of shar-
ing the physical space for collaboration. In fact, by turning the interaction
space from limited surface to free space, users will be able to collaborate within
the provided common physical space between the mobile devices. Therefore,
concept of single-user single-device might be extended to collaborative multi-
user multi-device using the shared space. In general, different configurations
for interaction between users and mobile devices might be considered. Based
on the desired purpose, number of users, number of devices, and the interac-
tion/visualization spaces might vary. Here, several possible scenarios for single
and shared interactive applications are introduced (see Fig. 3.3, 3.4).
3.1.2.1 Single-user, Single-device
In this scenario user holds the mobile device in one hand and the other hand
controls the interaction in the 3D space behind the device. The 3D space
between the display and user’s eyes belongs to the 3D visualization. User can
control and manipulate the content within the allocated spaces for interaction.
27
Chapter 3
3.1.2.2 Multi-user, Multi-device with Shared Interaction Space
In this configuration, more than one user share a common interaction space
to manipulate the content. They might sit in front or next to each other.
Each one holds a device in one hand and by the other hand interacts with
the content. Interaction happens in the common space between devices where
users might share the content, pass them around or manipulate them together.
3.1.2.3 Multi-user, Single-device with Shared Visualization Space
In this setup users share a single device for collaboration. They should use
the space behind the device for 3D interaction, and visualization happens on
a single display. This configuration is suitable for collaboration of two users.
3.1.2.4 Interaction from Different Locations for Multi-user Multi-device
In this configuration each user has his/her own location, device and space for
interaction with the digital content, but they share the same virtual space.
In other words, they interact with a common content from different locations
through the network connection. This model might be extended to the case
that the visualization space of one user is affected by the interaction of the
other users and vice versa.
3.2 Evolution of Interaction/Visualization Spaces
Mobile phones have been evolved substantially since 1980s, both from design
and functionality aspects. Before smartphones come to the market, the chal-
lenge was to reduce the size for portability purposes. After smartphones at-
tracted the users’ attention, for a better quality of experience, devices became
larger in display size with less physical buttons. During this evolution, many
devices have hit the market by their special features. Generally, if we consider
the mobile devices evolution during the recent decade, we can distinguish a
gradual change from both interaction and visualization aspects. In the earlier
28
General Concept and Methodology
Figure 3.3: Different configurations of single and collaborative interaction. 1:Single-user, single-device; 2: Multi-user, multi-device with shared interactionspace; 3: Multi-user, single-device with shared visualization space; 4: Interac-tion from different locations for multi-user, multi-device.
generation of mobile phones, the device’s surface was allocated to both inter-
action and visualization facilities (keypad and display). In that configuration,
visualization quality was quite weak due to the very limited area for display.
Afterwards, some manufacturers proposed an innovative solution to allocate
the whole surface to the display and design a physical keypad layer under the
display layer. During the recent years most smartphone manufacturers have
introduced their products with touchscreen displays. In this configuration
both interaction and visualization spaces are located on the same area.
By introduction of wearable displays such as Google Glass, the way users inter-
act with the device might be totally changed due to the removal of hand-held
module. This significant change enables users to benefit from 3D interaction
using both hands. It means that if the technical solution for bare-hand inter-
action is provided, users may perform different actions in 3D instead of using
weaker input facilities such as touch frames or voice commands.
In this thesis the proposed solutions to the limitations of the today’s technol-
ogy, extend the interaction to the physical 3D space. Interaction happens in
29
Chapter 3
Figure 3.4: In 3D interaction, users might share the 3D space for collaborativetasks in different applications.
the 3D space behind the mobile device and 3D visualization shows its effect
in the 3D space between the user and display. If we take into account the
other body parts such as head or foot as interaction facilities, the interaction
space can be extended to the 3D space around the device. This conceptual
model might be considered for future interactive smart devices (see Fig. 3.5).
However, the evolution trend in mobile interaction is towards designing sim-
pler and more intuitive input facilities. Clearly, advanced media technologies
combined with powerful hardware are required to make this evolution happen.
30
General Concept and Methodology
Figure 3.5: Evolution of the interaction/visualization spaces in mobile devices.
3.3 Enabling Media Technologies
As discussed before, the main objective behind this thesis is to provide techni-
cal solutions that enable users to experience a realistic interaction with future
smart devices in entertainment, communication, and information contexts. In
order to support the main concept, this thesis focuses on two major problems:
first, interaction design with mobile devices based on motion analysis in 3D
space [38, 39, 54, 55, 56, 57], and second, 3D visualization on ordinary 2D
displays [58, 59, 60, 61]. The proposed interactive systems are based on the
detection, tracking, and analysis of the user’s 3D motion from the visual input.
This visual input might be received from the mobile device’s camera, body-
mounted camera, webcam, and in general, from any type of vision sensor.
In this thesis, the main focus is on the analysis of the user’s gestures, captured
by the vision sensor, in real-time. Specifically, hand gestures are considered
due to the direct connection of the hands to the real life gestural activities.
31
Chapter 3
Figure 3.6: Enabling media technologies support the concept of 3D interactionand 3D visualization.
The retrieved 3D motion parameters from the detected gestures are used to
drive the real-time interaction in various applications.
Other technical contributions of this thesis are focused on providing the real-
istic visualization. They mainly include the technical solutions for converting
the 2D content to 3D and interactive visualization of the content based on the
user’s head motion (see Fig. 3.6).
3.3.1 Vision-based Motion Tracking in 3D Space
In this thesis 3D gestural interaction and its significant advantages in com-
parison with the current 2D technology is introduced. As it is technically
discussed in paper I [57], efficiency and effectiveness of the interaction with
mobile devices in 3D space is substantially higher than 2D space. This interac-
tion might happen by detecting and tracking specific hand gestures that plays
important roles in interactive applications. Alternatively, other body parts
such as head or foot might be hired to perform the 3D interaction. However,
the main idea is to use physical space for 3D manipulation on 2D devices. By
using six DOF motion analysis, problems of 2D interaction will be handled in
most cases [38, 39, 54, 55]. The important features of the proposed systems
are bare-hand, marker-less gesture detection, recognition and tracking from
2D video input. This technology enables users to efficiently interact with their
devices in real-time applications (see Fig. 3.7). Since the vision-based interac-
tion with hand-held mobile devices happens in the space close to the camera,
the distance between the moving subject and capturing sensor has some phys-
32
General Concept and Methodology
Figure 3.7: 3D gestural interaction with mobile device.
ical limitations. For instance, user cannot move his/her hand more than 35-40
cm away from the body. This limited space for interaction preserves the high
resolution motion analysis. When the vision sensor and moving subject are
relatively close to each other, the configuration will be more accurate for mea-
suring the 3D motion parameters. Therefore, the resolution in motion analysis
will be increased.
Another proposed configuration for 3D spatial interaction is interactive vision.
This setup is proposed for interactive 3D display where user interacts with the
content of the display based on the head motion. Head-mounted or static
vision sensors might be used to measure and report the head movements. By
the measured 3D motion parameters, users control the angle and viewpoint of
the digital content in real-time.
3.3.2 3D Visualization
Amount of multimedia content in digital devices has been enormously in-
creased during the recent years. Due to the substantial improvement of the
33
Chapter 3
Figure 3.8: 3D processing on 2D query images.
quality of cameras in smart devices, users can capture large amount of photos
and videos using their smartphones. Besides the challenges in ordering, orga-
nizing and interacting with huge collections, visualization quality is another
issue that should be taken into account. Since majority of the devices have
captured and stored the visual content by 2D technology, 3D visualization of
2D content will become a challenging task. While the today’s 3D technology
has attracted the users’ attention, it is quite important to find an effective way
to visualize the content in a more realistic fashion. Fig. 3.8 shows the system
overview for 3D processing and visualization of 2D content in mobile devices.
This thesis aims to tackle several challenges in visualization and improve the
quality of visual perception in interaction with the multimedia content. Specif-
ically, the following items are considered in the discussions:
ä First, is there any way to convey the experience of real 3D visualiza-
tion (similar to what people experience in watching a real world scene) to
users by measuring the users’ dynamic position/orientation in real-time (pa-
34
General Concept and Methodology
Figure 3.9: Active 3D vision for head motion-based user-device interaction.
per IV [61])?
ä Second, is it possible to display the stereoscopic 3D content on normal 2D
displays (paper I, II, IV, V, VI [56, 57, 58, 59, 62])?
ä Third, how can we recover the 3D information from a single 2D image and
visualize that in 3D format on an ordinary 2D display (paper V [59])?
ä Fourth, is it possible to make use of photo/video collections, captured by
2D devices, to convert and display the content in 3D (paper VI [58])?
ä Fifth, is there any efficient way to correct the 3D modeling, positioning
and localization errors by integrating the metadata from position/orientation
sensors and computer vision techniques (paper VII [60])?
The main focus of all 3D display technologies is to convey the illusion of
depth to the viewers’ eyes, while 3D visualization might be seen from different
perspectives as well. In fact, the real 3D perception, in a way we observe the
real world, is 3D manipulation based on motion, plus the depth perception.
An example might clarify this idea. Imagine a box in front of a user. Each
35
Chapter 3
side has a different color and pattern and from the front view user can only see
the top and front sides. In a natural manner, if user wants to see the left or
right sides he/she will move to the left or right directions and in the same way
users observe any scene by moving towards different directions. In another
approach, users can pick the box and rotate it to see any side they desire.
One way to observe the 3D space is to manipulate the scene by users’ motion.
In other words they should be able to control what they like to see. This
type of visualization might happen by analysis of the user’s motion in front of
the vision sensor and transmission of the motion information to the rendering
system for 3D visualization. Of course, this process should be performed in a
real-time fashion without any noticeable delay to deliver a realistic experience
to the users eyes. Moreover, the output might be rendered by stereoscopic
techniques to convey the illusion of depth. Therefore, concept of interactive
vision or interactive 3D displays can be formed based on this idea (see Fig. 3.9).
3.4 Methodology Overview
Technical contributions of this thesis are mainly focused on the development
of the enabling technologies for 3D gestural interaction. Therefore, 3D ges-
ture detection, recognition and tracking are technical features that are aimed
to be extensively used in the proposed solutions. Generally, gesture analysis
is considered as a classical computer vision and pattern recognition problem.
Thus, substantial part of the technical discussion of this work is allocated to
these challenges from classical approach. Low-level feature/pattern detection,
global model-based detection, estimating the motion from tracking robust fea-
tures and other computer vision methods are hired to find novel solutions for
solving the challenges of gesture analysis.
In addition to the common computer vision methods, a new framework for
gesture analysis is introduced. Due to the fact that capability of the modern
computers in storing and processing extremely large databases is substantially
increased, shifting the complexity of the methods from pattern recognition al-
gorithms to large-scale retrieval approach might be the new trend to tackle
36
General Concept and Methodology
the gesture analysis problems. Therefore, the introduced method is based on
collecting an extremely large database of gesture images and retrieving the
best match from the provided data. In the ideal scenario for gesture analy-
sis the database should include all possible articulated hand gestures and the
corresponding metadata including the relative spatial position and orientation
with respect to the camera. The major methodology is based on direct re-
trieval of the best match for any query gesture. The retrieval process should
be performed in a way that preserves the smooth motion in a continuous ges-
tural interaction. This step might be done by analysis of the gesture patterns
in high dimensional space.
The main methodology towards improving the visual perception is based on
3D visualization of the today’s multimedia content on current display devices.
The whole process might be divided into two steps. In the first step, 3D motion
analysis of the user’s head for real-time manipulation of the content should
be performed. In this step a vision sensor is used to track visual features
from the environment and motion analysis from consecutive frames is hired
to measure the 3D motion parameters. The measured parameters in real-time
help users interact with the content and manipulate that in a natural manner.
The methodology for visualization of the content is based on the processing
of the captured images and videos by the current 2D devices. The conversion
methods from 2D to 3D are based on the direct analysis of the single views or
multiple view analysis in photo collections. The main strategy is to convert
the 2D multimedia to 3D and use stereoscopic coding. This approach adds
additional value to the visual experience, while it does not require extra hard-
ware. In other words, besides the user manipulated content, the output might
be visualized by stereoscopic techniques to convey the illusion of depth to the
user’s eyes.
37
Chapter 3
3.5 Gesture Analysis through the PatternRecognition Methods
Basically, a common vision-based system for real-time gestural interaction is
composed of four main elements: user, vision sensor, gesture analysis compo-
nent, and visualization component. The real-time query input from user is
a continuous set of hand/body gestures. In this context bare-hand gestural
performance in free space is considered for most of the proposed scenarios and
in few cases head movements are used as query input. Ordinary vision sensors
can be divided into two groups: 2D cameras such as normal RGB webcams
and 3D depth sensors such as Microsoft Kinect. Since most of the ordinary
devices are equipped with normal RGB cameras, the main focus of this work
is to use that type of sensor for different research scenarios (embedding depth
sensors to mobile devices does not seem to be possible in the near future).
The gesture analysis step usually includes feature extraction, gesture detec-
tion, motion analysis and tracking parts. Pattern recognition methods for
detecting and analyzing the hand gestures are mainly based on local or global
image features. Simple features such as edges, corners, lines, and more com-
plex features such as Symmetry patterns, SIFT, SURF, and FAST features
are widely used in the computer vision applications [63, 64]. If the desired goal
is to detect a specific pattern, a combination of image features might be used.
For dynamic hand gestures, it is quite challenging to define a single pattern
for detection due to the complex combination of the hand joints. Therefore,
combination of local/global image features might be useful to detect and lo-
calize the hand gestures. Distinctive features are extremely useful for robust
tracking and 3D motion analysis. If the hand gesture is correctly detected and
localized, robust features such as SIFT or SURF might be used to analyze
the 3D motion parameters in a sequence of image frames. If the main goal is
to track the gesture in consecutive frames, the detection algorithm might be
conducted on single frames in a sequence. Another way to track the gesture
is to detect and localize the gesture in a single frame and follow the detected
38
General Concept and Methodology
Figure 3.10: Overview of the 3D gesture analysis process based on computervision methods.
pattern in the coming frames using common tracking methods such as Opti-
cal Flow. However, depending on the application scenario, if the recognition
of different types of gestures is required, different gesture patterns should be
analyzed. If the goal is to track a special gesture, the specific pattern might
be detected in consecutive frames, and if the 3D motion of the hand gesture
is required, gesture localization and 3D motion analysis from the sequence of
frames should be performed.
Finally, the gesture analysis output might provide the required information
about the type of gesture, and position/orientation of the detected gesture
with respect to the vision sensor. The retrieved information will be sent to
the real-time applications. The final output might be rendered in 2D/3D for
visualization on the display. Fig. 3.10 demonstrates the block diagram of the
3D gesture analysis based on computer vision methods.
39
Chapter 3
3.6 Gesture Analysis through the Large-scaleImage Retrieval
In addition to the computer vision methods for gesture analysis, this thesis
introduces a new framework and methodology for tracking articulated hand
motions in video sequences based on search technologies. The innovative so-
lution is to define the problem of hand tracking and gesture recognition as a
general image search problem. The idea is to build a large database that con-
tains at least thousands of hand gesture images. Ideally, these images should
emulate all possible hand gestures. Furthermore, these images are tagged with
hand motion parameters including 3D position and orientation of the gestures.
When the hand of a mobile device user is captured by the video camera around
the mobile device, the captured hand image is used to retrieve the most sim-
ilar hand gesture image stored in the database. Then the motion parameters
tagged with the matched image are given to the captured hand image. Thus,
3D hand tracking and gesture recognition can be achieved. The key of this
approach is how to quickly find the best match from a database. The proposed
solution is to treat each image as a document, convert shape features to a huge
visual vocabulary table, and employ the inverted indexing as a powerful re-
trieval tool to perform the search. The developed framework might have a big
impact on gesture analysis, where high resolution hand/gesture tracking is re-
quired. In fact, unlike the classical pattern recognition methods, in the search
framework, entries of the database will not be analyzed from shape-based or
model-based methods. The main idea is to include every possible hand ges-
ture image regardless of its shape or model. The entries of the database might
be real images of articulated hand gestures or computer generated graphics.
Here, the important point is to annotate the database entries with the position
and orientation information of the recorded hand gestures. The vocabulary of
hand gestures integrates the information from visual features of the gesture
images and their pose information in an extremely large table.
On the other hand, the query frame, captured by the vision sensor, will be
40
General Concept and Methodology
Figure 3.11: Overview of the 3D gesture analysis system based on the large-scale image search method.
pre-processed and its visual features will be extracted for analysis in the ges-
ture search block. The core of the system is the gesture search engine that
analyzes the similarity of the query input with the database entries in several
steps and retrieves the best match. The output of the system is the most
similar gesture image to the query input and in the ideal case is identical to
the query. Finally, the retrieved image and its annotated pose information
will be employed in the application. Fig. 3.11 shows the block diagram of the
3D gesture analysis system based on the large-scale image retrieval.
41
Chapter 4
Enabling Media Technologies
From a technical point of view, in order to enhance the usability of an interac-
tive system, numerous challenges must be considered. Specifically, interaction
design for mobile devices using hand gestures incorporates technical issues in
computer vision techniques such as detection, tracking, 3D motion estimation
and visualization. Basically, technical discussions around the proposed meth-
ods can be divided into the following categories: low-level pattern recognition
for gesture analysis, search-based gesture analysis, and interactive 3D visual-
ization.
In order to implement a gesture-based interactive system, various hand ges-
tures should be considered. Fig.4.1 demonstrates the most common hand
gestures for 3D interaction and manipulation of objects in different digital en-
vironments. Although the collected gestures can be used for different actions
such as pick, place, move, grab, zoom, rotate, etc., but they all might be seen
as variations of basic hand poses such as Grab or Pinch gestures. This is
the main reason that, in this context, gesture detection and recognition based
on computer vision methods are mainly focused on the Grab gesture and its
variations such as deformations, scaling, and rotations. Clearly, these gestures
can cover majority of the required actions in 3D interaction. Moreover, the
proposed search-based method for gesture analysis can be used for extremely
large number of hand gestures.
43
Chapter 4
Figure 4.1: Most common hand gestures in 3D interaction scenarios.
4.1 Gesture Detection and Tracking Based on Low-level Pattern Recognition
Low-level pattern recognition algorithms might be extremely useful in gesture
analysis. Although low-level features do not represent complex patterns in-
dependently, but due to the extremely fast process and low complexity, they
are highly recommended for real-time applications. Here the main challenge
is how to combine low-level features in an effective way to retrieve a global
meaning such as detecting a gesture pattern from a video sequence.
In the contributions of this thesis towards hand gesture detection and track-
ing, low-level features are extensively used [38, 39, 54]. Specifically, gesture
tracking based on low-level operators known as rotational symmetry patterns
is considered. As discussed in paper I [57], rotational symmetries are specific
curvature patterns derived from the local orientation image [65]. The main
idea behind rotational symmetry is to use local orientation to detect complex
curvatures in double-angle representation. The double-angle representation,
44
Enabling Media Technologies
z, of an orientation with the direction θ, is defined as a complex number with
an argument (angle), that is double the local orientation, z = cei2θ, where the
magnitude, c, represents the information about the signal energy or confidence.
Rotational symmetries can be categorized in different orders and phases. By
using a set of specific filters on the orientation image, it is possible to detect
different members of rotational symmetry patterns such as curvatures, circular
and star patterns. The idea of taking advantage of the rotational symmetries
in gesture detection seems to be rather general and complex, but modeling the
gesture by the choice of the rotational symmetry patterns of different classes
could lead us to differentiate between them and other features even in clut-
tered backgrounds. Theories and mathematical definitions of local orientation,
rotational symmetries, detection of symmetry patterns etc. are fully discussed
in paper I [57].
Through the experiments, it can be demonstrated that hand gestures or finger-
tips show high responses if we search for specific group members of rotational
symmetry patters in the orientation image. For example, fingertips are respon-
sive to the group of first order rotational symmetries (curvature patterns) [38],
or grab gesture is responsive to the group of second order rotational symmetries
(circular patterns) [39, 54]. Therefore, depending on the application scenario,
the proper detector for different hand gestures can be introduced. By this
approach, hand gesture can be localized in a sequence of frames captured by
the device’s camera. Increasing the selectivity of the desired patterns can be
achieved by applying the removal process on the noisy responses caused by
complex backgrounds [38].
For instance, if detecting the fingertips is desired, first-order symmetry pattern
detection will return the position of the fingertips as well as the noisy features
from the background. During the further processing, magnitude, phase, and
color properties of the responses can be used to differentiate between the cor-
rect detections and noisy points [66] (see Fig. 4.2).
Second order rotational symmetries return more specific patterns. For in-
stance, the grab gesture responses to the circular pattern from this group of
45
Chapter 4
Figure 4.2: Gesture detection, tracking, and 3D motion analysis based onrotational symmetry patterns.
symmetries. It is possible to enhance the detection by controlling the phase of
the pattern using a simple threshold. Thus, the grab gesture can be localized
properly in a video sequence [57].
However, the mentioned processing will result in a proper detection and re-
jects the noisy responses. From technical perspective, gesture detection and
tracking are the first steps in the 3D gesture-based interaction. The core of
the system is 3D motion analysis where the 3D motion parameters will be
recovered from the video sequence (see Fig. 4.2).
In many interactive applications, 2D gesture detection and tracking are enough
to perform the task and further 3D motion analysis is not required. For real
3D (six DOF) interaction, extra information about the 3D position and ori-
entation must be recovered.
4.1.1 3D Motion Analysis
In computer vision and image processing discussions, a common way to re-
trieve and estimate the motion between image frames is to analyze the motion
between the extracted feature points. 3D structure can be studied by finding
and matching the corresponding feature points in consecutive frames [67]. In
computer vision algorithms, various types of feature detectors and descriptors
have been introduced. Generally, feature detectors can be divided into edge,
46
Enabling Media Technologies
Figure 4.3: 3D motion analysis steps. Retrieving the 3D structure from motionbetween image frames.
corner and blob detectors or any combination of them [68].
In applications where robustness and accuracy have higher priority, more com-
plex feature descriptors are required. SIFT, SURF and ChoG [63, 64, 69] are
examples of robust feature descriptors which have been found to be useful in
many multimedia applications.
In the contributions of this thesis, scale-invariant feature transform (SIFT)
is widely used as a robust scale/rotation-invariant feature descriptor. Once
the hand gesture is localized, the SIFT features will be extracted on the de-
sired region (user’s gesture) in the image frame. The extracted features will
be tracked in consecutive frames and the structure of the 3D motion can be
derived by finding the transformation between two frames. This transforma-
tion might be in the form of planar homography [67], as discussed in [58], or
fundamental and essential matrix [67], as suggested in [39, 54, 60]. In order
to remove the outliers in matching between the feature points and find the
best transformation matrix, consistent with the true matches, robust iterative
methods such as RANSAC [70, 71] is performed. As a result, the best motion
transformation between two frames will be estimated (see Fig. 4.3). Paper
I [57], extensively explains how the 3D motion parameters can be retrieved
by decomposing the estimated transformation [39, 54]. In paper I [57] gesture
detection, tracking and 3D motion analysis from rotational symmetry patterns
are explained in detail. Moreover, the effect of applying 3D motion parameters
to different applications is demonstrated [57, 62].
47
Chapter 4
4.2 Gesture Detection and Tracking Based onGesture Search Engine
In this thesis a new framework and algorithms for tracking hands in cluttered
images and recognizing underlying gestures are introduced. To better specify
hand gestures and hand motions, two concepts might be distinguished:
Hand Posture: a hand posture is a static hand pose and its current position
without any movements involved.
Hand Gesture: a hand gesture is a sequence of hand postures connected by
continuous hand or finger movements over a short period of time.
For real-world hand tracking applications, the problems of initialization and
recovery have to be addressed. In order to develop robust solutions, we can
adopt a static approach, that is, to localize and recognize hand posture from in-
dividual frames. Thus, hand gesture recognition could be achieved by reading
individual posture images. The goal is that the new framework and algorithms
could lead to solutions with high tracking and recognition accuracy. The ac-
curacy should be so high that the solution could be used as a stand-alone
module for 3D hand tracking. Obviously, such solutions will also be useful in
providing single frame estimate to a 3D hand tracker, consequently achieve
automatic initialization and error recovery.
The proposed technical approach is to redefine the problem of hand tracking
and gesture recognition as a text search problem. The framework is based on
the idea of building a large database which in the best case emulates all possi-
ble articulated hand motions. Furthermore, these images are tagged with 3D
hand motion parameters including joint angles of articulated fingers. When
the hand of a user is captured by the device’s camera, the captured hand image
is used to retrieve the most similar image from the database. The ground truth
labels of the retrieved matches are used as hand pose estimates for the input.
It can work even under worse segmentation conditions. What is required to
input is just a bounding box around the hand gesture. The bounding box is
allowed to include arbitrary amounts of clutter in addition to the hand region.
48
Enabling Media Technologies
The key issue in this approach is how fast to find the best match from a
database containing gesture images. The proposed solution is based on treat-
ing each image as a document, converting shape features as words, and em-
ploying the powerful text retrieval tool (inverted indexing to perform the fast
search).
4.2.1 Providing the Database of Gesture Images
The core in the gesture search system is how to represent gesture contours.
To enable the formulation of the gestural interaction problem into a search
framework, two particular properties should be considered: first: shape sensi-
tivity, which means that the matched hand gesture shape should be as close
as possible to the one from input frame; second: position sensitivity, which
means that the matched gesture should be at a similar position as the in-
put gesture. In this work a new type of shape vocabulary is defined. The
introduced technique is based on dividing the contour into segments or edge
features. An individual segment is considered as a word for forming the search
table.
In order to form the search table, all the database images will be normalized
and their corresponding edge images are computed. Each single edge pixel
will be represented by its position and orientation. In order to make a global
structure for low-level edge orientation features, we can form a large table to
represent all the possible cases that each edge feature might happen. Con-
sidering the whole database with respect to the position and orientation of
the edges, an extremely large table can represent the whole vocabulary of the
hand gestures in edge pixel format. For instance, for image size of 640x480
and 8 orientation representation, for a database of 10000 images of hand ges-
tures, the gesture vocabulary table will have the dimension of 2457600x10000.
After forming this huge table, each block will be filled with the indices of all
database images that have features at that specific point. Therefore, this table
collects the required information from the whole database, which is essential
in the online gesture search.
49
Chapter 4
In addition to the processing of the database images to form the search ta-
ble, for each single gesture image in the database, the 3D motion parameters
will be calculated and tagged to that specific image. This process is done, by
mounting a motion capture sensor on the hand, while the database images are
being recorded. In the database, active vision sensor (hand-mounted camera)
is used to measure the gesture movements and annotate the gesture images.
4.2.2 Query Processing and Matching
A query hand gesture is any type of hand gesture with its specific position and
orientation. The first step in the retrieval and matching process is edge de-
tection. This process is the same as edge detection in the database processing
but the result will be totally different, because for the query gesture, pre-
sense of edge features from cluttered background and other irrelevant objects
is expected.
4.2.3 Scoring System
Assume that each query edge image, Qi, contains a set of edge points that
can be represented by the row column positions and specific directions. Basi-
cally, during the first step of scoring process, for all single query edge pixels,
Qi|(xu, yv), similarity function to the database images at that specific posi-
tion is computed as: Sim(Qi, Dj). If the certain condition is satisfied for the
edge pixel in the query image and the corresponding database images, the
first level of scoring starts and all the database images that have an edge with
similar direction at that specific coordinate receive +3 points in the scoring
table. Similarly, for all the edge pixels in the query image the same process is
performed and corresponding database images receive their +3 points. Here,
an important issue that might happen during the scoring system should be
considered. The first step of scoring system satisfies the need where two edge
patterns from the query and database images exactly cover each other, whereas
in most real cases two similar patterns are extremely close to each other in
position but there is not a large overlap between them. For these cases that
50
Enabling Media Technologies
regularly happen, the first and second-level neighbor scoring are introduced.
A very probable case is when two extremely similar patterns do not overlap
but fall on the neighboring pixels of each other. In order to consider these
cases, besides the first step scoring, for any single pixel, the first-level 8 neigh-
boring and second-level 16 neighboring pixels in the database images should
be checked. All the database images that have edge with similar direction in
the first level and second level neighbors receive +2 and +1 points respectively.
In short, scoring system is performed for all the edge pixels in the query with
respect to the similarity to the database images in three levels with different
weights. The accumulated score of each database image is calculated and nor-
malized and the maximum scores will be selected as the best top matches.
Finally, the proposed algorithm selects top ten matches form the database. In
order to find the closest match among the top matches, the reverse comparison
system is required. Reverse scoring means that besides finding the similarity of
the query gesture to the database images (Sim(Qi, D)), the reverse similarity
of the selected top database images to the query gesture should be computed.
Combination of the direct and reverse similarity functions will result in a
much higher accuracy in finding the closest match from the database. The
final scoring function will be computed as: S = [Sim(Qi, D)×Sim(D,Qi)]0.5.
The highest value of this function returns the best match from the database
images for the given query gesture. Afterwards, the tagged motion parameters
to the best match can be immediately used to facilitate various application
scenarios.
Another additional step in a sequence of gestural interaction is the smoothness
of the gesture search. Smoothness means that the retrieved best matches in a
sequence should represent a smooth motion. In order to perform a smooth re-
trieval, database gesture images should be analyzed in high dimensional space
to detect the motion maps. Motion maps indicate that which gestures are
closer to each other and fall in the same neighborhood in high dimensional
space. Therefore, for a query gesture image in a sequence, after the top ten
selection, the reverse similarity will be computed and top four matches will be
51
Chapter 4
Figure 4.4: Overview of the gesture search engine.
selected. Afterwards, the algorithm searches the motion paths to check which
of these top matches is closer to the previous frame match and the closest im-
age will be selected as the final best match. Fig. 4.4 shows the block diagram
of the gesture search engine. In paper III [56], the whole process is explained
in detail.
4.2.4 Quality of Hand Gesture Database
In general, for the database we have to consider two main issues: how large
the database should be? how to build such a database? The human hand is a
complex articulated structure consisting of many connected links and joints.
Including 6 DOF for orientation and position, there are 27 DOF for the human
hand in total [72]. To render all possible combinations of joints and poses
huge numbers of hand images will be generated. It is impossible to store all
images in a mobile device. Fortunately, there is a strong correlation between
joint angles. The state space of the joints has substantially lower dimensions.
In [73], Wu et al. show that the state space for the joints can be approximated
with 7 DOF. Thus, 7 is a rather good estimation of the embedded dimension
52
Enabling Media Technologies
Figure 4.5: Interactive 3D vision overview.
of hand postures. If we quantize each DOF and represent it with 3 bits, thus
we will have a total combination of 87 ≈ 2 millions states. Thus, we have a
rough estimation of the size of the database of hand gesture images, around
at least 2 millions.
The second issue is about how to build such a database. One solution is
to use a 3D hand model to render all possible hand postures with computer
graphics technology, and convert the generated gesture images into binary
shape images through edge and boundary detection. The major problem with
this approach is that the extracted edges are not natural, which directly affects
the search of the best matched hand shape. In this thesis, bare-hand is used
against a uniform background to perform all sorts of gestures. Hand gestures
are recorded and used for converting into binary hand shape images. Motion
sensors or video cameras are attached to the hand for measuring the exact
position/orientation of the gestures. Thus, the ground truth hand motion
parameters are tagged to the gesture images.
53
Chapter 4
4.3 Interactive 3D Visualization
The main idea behind interactive 3D visualization is to enable users interact
with the content of display based on their motion in the free space. In fact,
this technology helps them perceive the content in a realistic manner by con-
trolling the angle and viewpoint in real-time, and turn the normal screen to
an interactive digital window.
For accurate 3D motion tracking, active technology requires mounting the vi-
sion sensor on the user’s body. Since the aim is to manipulate the content
based on the user’s view point, the sensor will be mounted on the user’s head.
Therefore, the video sequence can be captured in real-time. In order to esti-
mate the head motion parameters from the visual input, the extracted image
frames from the video sequence will be processed in motion analysis step.
The proposed motion analysis technology is based on the analysis of the 3D
head motion between consecutive frames captured by the camera. For each
two consecutive image frames, a robust feature detector can be employed to
extract and track the important feature points from the environment. In most
cases, due to the robustness and scale-invariance properties, SIFT feature de-
tector is used. Afterwards, the relation between the two sets of corresponding
feature points can be represented by a transformation matrix. For instance,
planar homography, or for more accurate representation, Fundamental and
Essential matrices will be calculated. The transformation matrix contains the
information about the motion between the two image planes. In the next
step, the decomposition process should be applied on the transformation ma-
trix to retrieve the 3D motion parameters. This process will be performed on
each two consecutive frames and the relative 3D position/orientation will be
estimated. The motion analysis block provides six outputs for the rendering
block, three representing the orientation parameters and three representing
the position parameters in x, y, and z coordinate system.
Note that SIFT feature detection at each single frame requires rather heavy
processing which is not a problem in stationary systems. For faster processing
especially on mobile platform, faster detectors such as SURF, or FAST fea-
54
Enabling Media Technologies
Figure 4.6: Real-time interaction with the graphical content using interactive3D vision system.
tures can be used. Another approach is to perform the feature detection in
the first frame and track the detected features by common tracking methods
such as Optical Flow in consecutive frames. The process of feature detection
can be repeated when the number of features reduces to a certain value. The
rendering block generates and updates the scene based on the provided motion
information at each moment. The rendered scene is based on the pre-defined
graphical model or augmented reality environment. The output result will be
displayed on a screen while user may have the chance to interact and manip-
ulate the content in real-time. Fig. 4.5 demonstrates the system overview of
the interactive 3D vision. Fig. 4.6 shows how user controls the viewpoint and
position in the rendered scene by moving in the 3D space. For capturing and
measuring the 3D position and orientation of the user, an ordinary webcam is
mounted on the user’s head. The graphical content will be updated according
to the translation and rotation of the head at each moment. The perception
effect is similar to looking at a real scene from a window. In fact, view of a
55
Chapter 4
real scene will be adjusted based on the angle and position of the viewer. In
paper IV [61], interactive 3D visualization is discussed in detail.
4.4 Methods for 3D Visualization
As mentioned before, in order to enhance the quality of experience in multi-
media applications, the aim is to visualize the output in 3D format. Here, the
following scenarios might be considered.
ä First, 3D visualization of a graphical model with a known geometry [39, 54],
ä Second, 3D visualization of single images using the image itself [59],
ä Third, 3D visualization of monocular images by analysis of the multiple
views, in 2D digital photo collections [58].
A common way to visualize the content in 3D format is to produce stereo
views. As fully discussed in [54, 57, 58, 59], stereoscopic systems transmit
stereo views of a scene which have been represented by two viewpoints with
a slight horizontal translation. Basically, for rendering the graphical mod-
els in different applications, geometry of the scene is known. Therefore, it is
rather simple to render the second view which satisfies the required geometry
for stereoscopic views. The task of stereoscopic visualization becomes more
challenging when the content is not recorded with stereo cameras or any prior
knowledge about the geometry or structure of the 3D scene is not provided.
Considering single views or randomly captured views of a scene, an efficient
way to generate stereo views should be found. 3D visualization from single
and multiple 2D views are briefly described in the following sections. In paper
V and VI [58, 59], the whole process is explained in detail.
4.4.1 Depth Recovery and 3D Visualization from a Single View
Making stereo views from a single monocular image is one of the most chal-
lenging tasks in computer vision. The first step in making 3D from single
56
Enabling Media Technologies
images is to recover the depth map. This process is performed by applying su-
pervised learning algorithms on a set of images and the corresponding ground
truth depth maps. Statistical image modeling and estimation techniques such
as Markov Random Fields (MRF) are used for training the system [74]. After
the training process, the depth map for a query image will be recovered. Once
the depth map is estimated, the required information for generating stereo
views will be calculated as suggested in [59].
4.4.2 3D Visualization from Multiple 2D Views
Although 2D digital photo galleries and collections do not contain any 3D
information, but by performing computer vision techniques, interesting 3D
images and videos can be generated. Basically, in many photo collections
there are a lot of hidden connections between images. These connections might
be represented by a transformation matrix. In fact, any two, three or more
unstructured photos of a scene, might capture the overlapping areas. This
means that by finding the geometrical transformation between the overlapping
images, the 3D information of the real scene can be inferred. In paper VI, the
process of generating stereo views for 3D visualization by matching the feature
points and finding the homography transformation between the overlapping
frames is discussed in detail [58].
4.5 3D Channel Coding
The final step to visualize the content in 3D is to encode the stereo channels.
The coding techniques vary according to the display device technology. For
instance, in passive 3D systems with polarized glasses, the stereoscopic output
should be transmitted in channels with different polarities [75], while in ac-
tive shutter glasses, stereo frames are transmitted with twice the original rate
(60 × 2 = 120 frames/sec) [75]. In the implementations ordinary 2D displays
are considered for rendering the 3D output. A common group of stereoscopic
techniques that do not require 3D displays are color anaglyphs [76, 77]. In
57
Chapter 4
Figure 4.7: Contributions in 3D visualization.
anaglyph methods, the stereo frames are encoded into two different colors
for left and right eyes. The color-coded stereo frames should be merged and
displayed as a single layer on the display. Depending on the coding method
an appropriate low-cost glasses are used to decode the displayed output. An
appropriate eye-glasses features two different colors for left and right lenses,
each for filtering the corresponding layer from the output image. In the im-
plementations, two enhanced techniques for generating more realistic outputs
are performed. These two techniques are known as Optimized Anaglyph and
Color-code 3D [47, 76].
58
Chapter 5
Experimental Results
5.1 Experiments on Gesture Detection, Trackingand 3D Motion Analysis
Basically, the implemented system for gesture detection and tracking based
on low-level patterns includes the gesture input from user, vision sensor, and
algorithms for 3D gesture analysis.
The target gesture for implementation of the gesture detector using low-level
patterns is the Grab gesture. Selection of the grab gesture is due to the
conducted studies on the human intuitive hand gestures for daily tasks such
as pick, place, object manipulation, etc. [57]. In the experiments, the grab
gesture is not considered as a rigid object. In fact, the implemented system is
designed in a way that tolerates the deformation and rotations of the gesture
up to a certain limit that the global shape is preserved in the captured frames.
5.1.1 Camera and Experiment Condition
Experiments on gesture detection using rotational symmetry patterns are gen-
erally conducted in the lab environment with normal lighting condition and
different backgrounds. For all the experiments, a single RGB webcam is used.
The camera is used in both static and semi-dynamic (holding by one hand)
setups to simulate both stationary and mobile configurations. The distance
59
Chapter 5
Figure 5.1: Sample variations of Grab gesture.
between the camera and user’s gesture is normally between 15 to 40 cm. For
testing the robustness of the system, various backgrounds with different colors
and patterns are used. In addition, number of users with different skin color
and size are considered in the tests.
5.1.2 Algorithm
In the experiments of this thesis rotational symmetry patterns are used with
two different approaches for detecting the grab gesture. First approach is
based on the first order symmetry patterns. This group represents the cur-
vature patterns with different orientation. Observations reveal that fingertips
are responsive to this group of symmetry patterns. On the other hand, curva-
ture patterns are rather general and noisy points from the background might
show a similar response to the first order symmetry detector. Therefore, in
order to differentiate between the noisy points and fingertips, more features
should be integrated in the algorithm. First criterion is the magnitude of the
responses. Normally, responses of the fingertips are much stronger than noisy
points. Another feature is phase. Since the intuitive hand gesture will not
fully rotate in any angle, by setting a threshold on the phase we can limit
the responses to natural observations. The third point is the skin color. An-
other threshold on the color of the responses can help to remove more noises.
Finally, by including all the conditions, best responses that represent the fin-
gertips will be detected. Although this approach requires further processing
for detecting the fingertips, but it provides more flexibility for detecting the
60
Experimental Results
Figure 5.2: 3D model manipulation using second order rotational symmetrypatterns. The graphical model follows the exact motion of the user’s handgesture in 3D space.
deformed gesture patterns. By detecting the fingertips and measuring the dis-
tance between them, it will be possible to model various hand gestures.
Another developed method for detecting the grab gesture is based on the
second order symmetry patterns. Second order group represents the circular
patterns. With some constraints on the phase of the detected patterns the
circular form of grab gesture can be detected properly. The constraints on the
phase can be set based on the restriction of the wrist joints in rotation within
a limited angle. Therefore, the grab gesture can be detected by searching the
circular patterns with phase variation between +/- 45 degrees (see Fig. 5.1).
In order to improve the robustness of the system, after the first detections, a
region of interest is defined around the localized gesture to secure the correct
detection in consecutive frames and automatically remove the noisy points.
Since the second order rotational symmetries represent more complex pat-
terns, the noisy points are significantly less than the previous case, but on the
other hand, flexibility of the user’s gesture is lower than fingertip detection.
The center of the detected patterns using second order symmetries returns the
point around the center of the grab gesture.
In order to retrieve the 3D motion of the localized gesture between the image
61
Chapter 5
frames, SIFT feature detection and tracking is performed. For faster process-
ing, the feature points on the first frame are detected and tracked in consecu-
tive frames to retrieve the 3D motion parameters. In the PC implementation,
SIFT feature matching between all the frames are tested indeed. For the for-
mer case when the number of features reduces to less than 35 points, feature
detection will be restarted to guarantee the robust motion analysis.
5.1.3 Programming Environment and Results
The first version of the gesture analysis system, based on the rotational symme-
tries, has been developed in Matlab. After approving the preliminary results,
in the next step, the program was implemented in C/C++ environment. This
step significantly improved the efficiency of the system in processing the video
sequence in real-time.
Since rotational symmetries are low-level patterns, from computation perspec-
tive, the detection process is extremely fast. This is the major advantage that
improves the quality of interaction in real-time applications. Even with the
further processing for retrieving the 3D motion parameters, we can achieve
the required performance for efficient interaction. As reflected in [57, 66], the
measured detection accuracy shows the effectiveness of the algorithm. The
implemented system based on the first order rotational symmetry detector re-
turns the fingertip positions. In order to localize the grab gesture, the middle
position between the detected thumb and index finger is considered as the
output.
The implemented system based on the second order symmetry detector re-
turns the center of the circular pattern. Thus, for grab gesture, position of
the response is always a point close to the gesture center.
In both mentioned cases, the detected gesture points are used for manipulation
of the graphical objects. The tested scenarios are based on the 3D tracking
and rotation with six DOF motion analysis (see Fig. 5.2).
62
Experimental Results
Figure 5.3: Capturing the motion parameters for tagging the pose informationto the database images.
5.2 Experiments on Gesture Search Framework
Experiments on gesture search framework can be divided into two steps. First:
offline step that includes the process of constructing the database entries, tag-
ging the motion parameters, and forming the vocabulary table of the gestures.
Second: the online gesture search process for query input. This step includes
the scoring process and neighborhood analysis for finding the best match for
query input.
5.2.1 Constructing the Database
The main strategy behind constructing the database is to record and store
all possible hand gestures including the deformations, scaling, and translation
variations. Moreover, the stored gesture frames should contain the 3D mo-
tion information for instant retrieval after the matching step. For this reason,
active vision system is used to immediately retrieve and tag the 3D motion pa-
63
Chapter 5
Figure 5.4: Active motion analysis for tagging the orientation information tothe database images. Vision sensor is attached to the back side of the handfor measuring the 3D motion parameters. The retrieved motion parametersare applied to the 3D model to validate the accuracy.
rameters (six parameters including the 3D position/orientation information)
to each image frame during the process of recording the database images.
The whole database is recorded in the lab environment with stable lighting
condition and plain green background. In order to easily obtain a clear image
of the gesture and eliminate the rest of the image, extra green paper for cov-
ering the arm and hand-mounted camera is used. Active camera is mounted
at the back side of the hand to let the second camera capture the video se-
quence while the user is performing different gestures. Therefore, the active
camera captures the frames from the environment for online 3D motion analy-
sis. The second camera captures the gesture sequence simultaneously. Finally,
the retrieved 3D orientation, based on the hand motion, will be tagged to the
64
Experimental Results
synchronized frame from the second camera and this process will be continued
to complete the construction of the database (see Fig. 5.3 and Fig. 5.4).
Process of generating the database images and retrieving the orientation pa-
rameters are conducted in C++ environment and is performed in real-time.
Another reason to cover the arm with background color and provide a clear im-
age of the hand is to calculate the 3D position of the gesture in each database
image. During this step, first the database images are converted to edge im-
ages. Afterwards, average position of the edges based on the image coordinate
system is calculated for each frame. Moreover, the bounding box around each
gesture is defined. In fact, size of the bounding box reflects the scaling factor
or depth of the gesture with respect to the camera position. Finally these
three parameters representing the 3D gesture position will be retrieved.
At this step the database of the hand gestures including the gesture images,
converted gesture edge images, and the corresponding text files containing the
six motion parameters is constructed.
5.2.2 Forming the Vocabulary Table
The implemented algorithm for finding the best match for query frames are
based on the low-level edge orientation features. Thus, the vocabulary table
contains the indices of the relevant database images at different locations.
The vocabulary table is defined by number of database images as row size,
and mxnxnθ as column size where m and n represent the width and height of
each image and nθ is the number of angle intervals. For most of the conducted
tests with image size of 320x240, eight angle intervals, and 6000 images in the
database, 6000 rows and 614400 columns represent the vocabulary table. Each
block in the vocabulary table stores the indices of the database images that
have edge at that position and with the similar orientation. The conducted
experiments reveal that with the database size around 6000 the maximum
number of indices at each block will not exceed 100. The whole process of
forming the vocabulary table is performed in Matlab. The final constructed
table is stored in text format for online retrieval step.
65
Chapter 5
5.2.3 Gesture Search Engine and Neighborhood Analysis
The online search system is implemented in C++ environment for efficient
interaction in real-time. First, the vocabulary table will be loaded in the
memory for fast retrieval. Afterwards, each frame from the real-time video
input will be sent to the gesture search engine. After the direct and reverse
scoring steps, the top four matches will be sent to the neighborhood analysis
step and the best match will be selected.
Different methods for analysis and mapping of the gesture images from high
dimensional space to 3D space are introduced in paper III [56]. The main
idea behind that is to analyze the distance between the gesture patterns and
construct a meaningful pattern for neighborhood search. Since gestural in-
teraction represents a smooth motion in 3D space, neighborhood analysis for
selecting or predicting the closest match from the database for query inputs
is quite important. In the implementations, Laplacian method for mapping
the gesture vectors from high dimensional space to 3D space is selected. Se-
lection of the Laplacian among other methods such as PCA, and LLE is due
to the visible pattern from the 3D representation of the image vectors. As
demonstrated in Fig. 5.5, each branch in the graph indicates the clear change
in positioning of the gesture patterns within the database images. Basically,
the dense center mostly represents the gestures around the center point of the
image and each branch shows the direction towards the corners of the im-
age frames. In the process of selecting the best match from the top matches,
neighborhood analysis is used to return the closest gesture match based on the
previous selected match in a video sequence. This step smoothes the motion
of the retrieved sequence.
5.2.4 Gesture Search Results
During the experiments various databases have been provided for testing the
performance of the system. The earliest database contained about 1500 images
of the grab gesture. Later, another database with 3000 images including the
grab gesture and other types of hand gestures was captured. Afterwards, the
66
Experimental Results
Figure 5.5: Left: Gesture images mapped to the three dimensional space byLaplacian method. Right: Gesture and non-gesture images mapped to the 3Dspace by PCA.
database was extended to more than 6000 images. At this step the non-gesture
images were also included to analyze the performance on a larger database
with noisy entries. At each step, system test have been conducted on the
resized images, as well. In general, both 320x240 and 160x120 images show
quite promising results in the tests. Most of the tests are based on the 320x240
images but if the database size increases to more than 10000 entries, the image
size might be set to 160x120 to improve the efficiency of the retrieval.
Among the mentioned steps in online retrieval system, the reverse scoring
consumes most of the processing time. This is the major reason that the
reverse scoring is conducted for the top ten matches retrieved by the direct
scoring step. For instance, with 6000 images in the database, and image
size of 320x240, the retrieval system by the direct scoring step can process
25 frames/second while by applying the reverse scoring it will be reduced
to 15 frames/second. Thus, the reverse scoring shows the stronger effect on
increasing the processing time than the database size. Fig. 5.6 shows sample
gesture inputs and the corresponding best matches from the database of hand
gestures.
67
Chapter 5
Figure 5.6: Output of the gesture search engine for number of sample querygestures.
5.3 Technical Comparison between the Prior Artand the Proposed Solutions
Basically, majority of the hand gesture recognition and tracking systems em-
ploy vision-based approaches to handle the technical challenges. RGB and
depth cameras or combination of them (i.e. Kinect) are the widely used hard-
ware for capturing the body gestures. Since contributions of this thesis target
the current and future mobile devices, the introduced methods are based on
using ordinary RGB cameras such as webcams and mobile cameras. Although
various algorithms have been introduced in the computer vision and pattern
recognition discussions, the proposed solutions of this thesis can be compared
to the common approaches for hand detection, gesture recognition, gesture
tracking and 3D motion analysis. 2D and 3D features, 3D models, skeletal,
appearance, color, and depth information are among the most known proper-
ties that have been used for gesture detection and tracking. In fact, majority
of the prior art can be grouped into the mentioned categories or combination
of them.
As discussed before, the proposed gesture analysis system based on rotational
symmetry patterns can be considered as a combination of the mentioned com-
68
Experimental Results
puter vision approaches. On the other hand, the introduced gesture analysis
system based on large-scale search is not a classical computer vision approach.
However, technical contributions of this thesis can be compared with the prior
art from different aspects. Table 5.1 provides a comprehensive comparison
between the prior art and the proposed solutions. Method 1 and 2 represent
the rotational symmetries and gesture search method, respectively. Ratings
are estimated based on the reviews and surveys on the current vision-based
technologies [78].
5.4 3D Rendering and Graphical Interface
3D rendering is the process of generating a graphical view based on the three-
dimensional models. 3D models might contain various properties such as ge-
ometry and texture. Finally, the 3D rendering process depicts the 3D scene as
a picture, taken from a particular perspective angle and it might be changed
based on the desired viewpoint in a continuous sequence. Various features
such as lighting, shadow, atmosphere, refraction of light, or motion blur on
moving objects can enhance the realistic perception in 3D rendering.
By the development of the modern computers, 3D rendering has become a
major step in many applications such as video games, simulators, movies,
augmented reality, virtual reality, etc.
In order to convey the realistic 3D experience to users, two possible approaches
or combination of them might be considered. First, the generated graphical
view might be understood by various noticeable features such as perspective,
shading, texture-mapping, reflection, depth of field, transparency, translu-
cency, refraction, etc. Second, the generated scene might be rendered using
stereoscopic techniques to convey the illusion of depth and the final result can
be visualized on 3D displays. Nowadays, due to the popularity of the 3D dis-
plays both techniques are combined to enhance the quality of user experience.
Since the interaction between users and digital devices happens through the
interface level, the effect of provided technical solutions might be visualized
in a graphical interface. Basically, two scenarios are considered for graphical
69
Chapter 5
Method
Properties
shape-based
color-based
depth-based
3D model-based
Method1Rot.sym.
Method2Gesturesearch
Efficiency of de-tection
3 5 5 4 5 5
Accuracy of de-tection
4 3 5 4 4 5
Tracking qual-ity
4 3 5 4 4 5
Gesture recog-nition
4 3 4 4 4 5
Robustness toenvironmentalconditions
3 1 2 3 3 5
3D motion 3 2 4 4 4 4
Large-scale ges-ture
2 3 4 4 3 5
Cluttered back-ground
3 2 5 4 4 5
Occlusion 2 1 1 3 1 3
Scale-invariance
2 3 4 3 3 5
Rotation-invariance
2 3 4 3 3 5
Deformation-invariance
2 3 3 3 3 5
Mobile plat-form
3 3 0 2 4 5
Multi-gesture 4 3 5 4 3 5
Table 5.1: Properties of the different methods in gesture analysis are comparedwith the proposed solutions of this thesis. Method 1 and 2 represent thediscussed methods based on Rotational Symmetries and Gesture search engine,respectively. Quality of the different properties is scaled between zero and five.0: not applicable. 1: very weak. 2: weak. 3: average. 4: strong. 5: verystrong.
70
Experimental Results
interface design in this thesis. First, manipulation of graphical objects using
hand gestures, and second, manipulation of the graphical scene in interactive
3D vision. For both cases the graphics are rendered in OpenGL environment.
Perspective projection, lighting, color, reflection and other features are used
to provide a realistic 3D experience. Moreover, in order to convey the illusion
of depth, the rendered output is provided in color-code stereoscopic 3D. By
this technology, users might be able to experience the illusion of depth on any
2D screen using a simple color-code glasses.
In most designed environments for 3D gestural interaction, manipulation of
the graphical objects in an augmented environment is considered. Normally,
the rendered objects are shown on the live camera view while user can pick,
rotate, move, zoom in/out, or even reshape the objects in real-time.
For interactive 3D vision the main goal is to place the users in a virtual reality
environment, enabling them move and perceive the rendered scene in an in-
teractive manner. Thus, the recommended setup for this scenario is a rather
large or possibly a wall-sized screen.
5.5 Research Scenarios
Conceptual and technical contributions of this thesis have been tested and
used for implementation in different research scenarios. The major research
scenarios can be summarized in the following items.
5.5.1 Implementation of the 3D Gestural Interaction on Mo-bile Platform
Since one of the main target areas for applying the proposed technologies is the
future mobile devices, implementation on mobile platforms are the essential
part of this work. Android platform is selected for mobile implementation of
the gesture-based interaction. The core of the system for detecting, tracking
and analysis of the gestures is developed in native C/C++, in OpenCV envi-
ronment. The graphical part is mainly handled by OpenGL (Open Graphics
Library [79]). In some earlier versions, Min3D (3D library for Android using
71
Chapter 5
Figure 5.7: Graphical interface in mobile application. Implementation of theproposed systems in photo browsing and 3D manipulation.
Java and OpenGL ES) has been used for rendering different graphical objects
(see Fig. 5.7).
5.5.2 Implementation of the Interactive 3D Vision on a Wall-sized Display
The proposed interactive 3D vision is tested in three different setups. The first
test is performed on normal computer display with both 2D and stereoscopic
3D rendering. In the Second test, output is displayed on the wall, using a video
projector. Third test is performed on the 4K wall-sized display in KTH VIC
lab (visualization studio). In all three cases the graphical scene is rendered in
OpenGL environment. In the stereoscopic case, passive 3D glasses are used
for depth perception (see Fig. 5.8).
Here an important point to mention is that the interactive 3D vision on per-
sonal devices might be set in both active and passive configurations. As dis-
cussed before, in active configuration where the vision sensor is mounted on
the user’s head, the resolution of the 3D motion analysis is significantly higher
72
Experimental Results
than passive configuration. The accuracy level is highly dependent on the
relative distance between the moving subject and the vision sensor. Thus,
if we remove the body-mounted camera and simply use the device’s camera
for motion tracking, decreasing the distance between user and device might
compensate the accuracy to a proper level. This case usually happens where
users interact with their devices in closer range such as operating on laptops,
smartphones and tablets. In these cases, due to the simplicity and comfort of
the users, passive configuration is more practical. Although it will not provide
the same level of accuracy as active vision but it is generally acceptable for
natural interaction (see Fig. 5.9).
In the conducted experiments on MacBook Pro, the device’s camera is used for
tracking the head motion. In order to improve the quality of tracking, in the
first step, face detection is applied to immediately separate the moving part
from the rest of the image. Afterwards, the discussed technology for tracking
and estimating the 3D head motion is used to provide the required data for
3D interaction with the content.
In larger interaction spaces, such as visualization on wall-sized displays, for
accurate and high-resolution interaction, active motion estimation is unavoid-
able. Quality of motion tracking with passive installation is quite weak for
large distances between sensor and moving subject.
5.5.3 3D Rendering and Visualization of 2D Content
Unlike the 3D graphics where the geometry of the scene and objects are known,
3D visualization of the 2D content such as images and videos are quite a
challenging task. Single view and multiple view analysis for retrieving the 3D
information from 2D content are introduced in the contributions of paper V
and VI [58, 59]. The main idea is to provide a supportive technology to enhance
the user experience in interactive applications. This technology enables users
to see the content in 3D while they operate in the application or manipulate
the content. 3D visualization is based on the stereoscopic techniques using
passive glasses. Due to the simplicity of the tests and applicability to any
73
Chapter 5
Figure 5.8: Active 3D vision tests in different setups.
type of display, in all the conducted tests, anaglyph glasses are used [58, 59].
3D channel coding is performed based on the selected glasses.
5.6 Potential Applications
Contributions of this thesis in 3D motion analysis and visualization can be used
in a wide range of multimedia applications on mobile devices and stationary
systems. Virtual reality, augmented reality, medical imaging, motion based
interactive systems, 3D games, 3D displays, motion-based localization and
positioning systems, visual search and many other applications might take
advantage of the proposed methods. Here, several implemented and potential
applications based on the contributions of this thesis are briefly explained.
74
Experimental Results
Figure 5.9: Passive configuration shows the similar effect as active in close-range interaction.
5.6.1 3D Photo Browsing
Interactive photo browser enables users to manipulate their photo collections
in 3D space. Unlike the 2D interaction, where only one user could operate
on the device (due to the limited area for interaction), in 3D interaction, two
or multi users might share both the interaction and visualization spaces for
collaborative tasks. Users might sit together to share their photo collections
and manipulate them in 3D space while they have their own devices. They
can use only one device and share the interaction and visualization spaces.
They might share the virtual space if they are present at different locations,
etc.
5.6.2 Virtual/Augmented Reality
In [39, 54], gestural interaction techniques are applied to render graphical
objects in augmented environments. The analysis of the hand gesture motion,
behind the mobile phone’s camera, is used to manipulate graphical models.
75
Chapter 5
The six DOF motion control with high level of accuracy enables users to
experience a realistic interaction with their mobile devices. The proposed
gestural interaction for manipulation of graphical models is also implemented
on Android platform. Efficiency and performance of the system is tested and
validated on different devices such as Samsung and HTC smartphones. The
visual outputs are rendered in both 2D and 3D formats.
5.6.3 Interactive 3D Display
Human motion tracking might be used to interact with the display device. In
the implemented interactive systems introduced in [57, 61], user controls the
content of the display by using head or gesture motion. The retrieved motion
parameters (rotations and translations in three axes), between the consecutive
frames captured by the device’s camera, applies the motion control to the
application.
5.6.4 Medical Applications
The proposed technologies might be widely used in medical applications. 3D
motion tracking and analysis of the patients, help physicians for diagnosing
and treatment of the physical disorders in various types of diseases. Fur-
thermore, 3D imaging and visualization of the body organs and interactive
3D manipulation on display devices help experts to analyze and diagnose the
physical problems and select the required treatment.
5.6.5 3D Games
One of the most exciting areas that can benefit from the efficient 3D motion
analysis is 3D gaming. Bare-hand, marker-less gesture analysis by using ordi-
nary 2D cameras provides a great chance for experiencing a realistic interaction
with the graphical environment in 3D games. Head and gesture detection and
tracking, using the techniques discussed in the previous chapters, provide an
effective way for playing in 3D environments.
76
Experimental Results
5.6.6 3D Modeling and Reconstruction
Many digital photo/video capturing devices, in addition to a vision sensor,
present other types of embedded sensors such as GPS and orientation sensors.
Therefore, extra information such as position and orientation will be tagged to
the captured photos. This geo-tagging, have been found to be useful in many
applications such as 3D digital photo albums, photo-tagged maps and visual
navigation. In most cases, the geo-tagged meta-data are corrupted by noise
or missing due to the unavailability of the GPS signal or magnetic sensors.
In paper VII [60], we discuss how the 3D motion analysis can help to form a
signal model and significantly correct this noisy data.
5.6.7 Wearable AR Displays
Contributions of this thesis perfectly fit the area of mobile augmented reality.
Thus, AR glasses such as Google glass that integrate the information through
the augmented environments require intuitive interaction technology. Due to
the fact that in wearable AR glasses, touchscreen will be removed or placed in
a smaller scale, convenient 3D gestural interaction can definitely enhance the
interaction experience.
5.7 Usability Analysis in Object Manipulation:Touchscreen Interaction vs. 3D Gestural Inter-action
In order to evaluate the user experience in 3D gestural interaction, a com-
parative user study is conducted. In this study, manipulation of graphical
objects in 3D space, using bare-hand gestures, is considered. Learnability,
user experience and interaction quality is evaluated and compared with the
same task in 2D touchscreen interaction. Four students from the course, Eval-
uation Methods in HCI (DH2408), assisted this study by selecting this case as
their course project. In order to provide a comparative scenario for evaluating
the 3D gestural interaction, two sets of designed interfaces and tasks for 2D
77
Chapter 5
touchscreen and 3D gesture-based interaction, required usability tests, and
questionnaires for user interview were provided for the students. In this task,
they were supposed to invite uses, test the learnability and usability of both
systems, and collect and report the required information based on the given
instructions. Here, the whole process is explained in detail.
Touchscreen interaction: Two smartphones are considered for this case.
Smartphones are positioned side-by-side on a table. Smartphone 1 plays a
pre-recorded video of the rendered graphical model. On the smartphone 2,
the same graphical model is rendered where user can manipulate that through
the touchscreen and control the position, zooming and viewpoint in x, y, and z
coordinates. During the task, user should follow and mimic the exact motion
of the graphical model on smartphone 1 through the real-time manipulation
of the model on smartphone 2 using touchscreen interaction. A webcam is
mounted on top of both smartphones. This camera records the touchscreen
interaction for further studies.
3D Gestural interaction: In this case, user can control and manipulate
the same graphical model in 3D space using bare-hand interaction. Kinect
depth sensor is used to detect and measure the user’s hand motion in 3D
space. Similar to the previous case, the same pre-recorded motion tasks are
displayed on the computer screen. User should follow and mimic the motion
of the graphical model in free space through the real-time 3D interaction. A
camera records the whole task for further studies.
Task: Both 2D and 3D interaction tasks are divided into different parts.
In each part the graphical model will move with a specific motion sequence to
reach a certain position/orientation. Afterwards, user should follow the same
motion to reach the similar position/orientation. These pre-recorded tasks are
divided into 10 parts. First two videos are used for learnability step where
new users learn how to work with both 2D and 3D tasks. In the main part, 8
78
Experimental Results
Figure 5.10: User test in 2D touchscreen and 3D gestural interaction.
videos (2 easy, 4 normal, and 2 hard) are considered (see Fig. 5.10).
5.7.1 User Test
For this study ten users were selected: one pilot user, seven in the primary
target group (experienced in using touchscreen) and two in the secondary tar-
get group (very little or no experience in using touchscreen). According to
Nielsen [80, 81], five people find 85% of the problems. Therefore, ten users
(mostly between the age 20 and 30) were enough to provide proper results.
As mentioned before, the goals are set to test the learnability for the 3D gestu-
ral interaction system as well as comparing the user experience of 3D gestural
interaction to the touchscreen interface. For the comparative analysis, effi-
ciency, effectiveness and user satisfaction are considered as the main criteria.
In order to increase the reliability of the tests, a subjective data based on the
experience of the participants during and after the test sessions were gathered.
This was done through filling the scale-based forms, and answering predefined
questions.
On the other hand, user performance is observed by Seeing as Doing during
the comparative tests and the manual data are collected. Eventually, these
two steps could provide access to both quantitative and qualitative data for
final evaluation.
79
Chapter 5
Figure 5.11: Average score of the 2D vs. 3D user performance.
5.7.2 Usability Results
Since the tests are based on following the movements in a video, for the quan-
titative measurement of the user performance, the following method is consid-
ered. This method is motivated by the MUSiC user performance method [82].
Therefore, all observers in the student group watched the recordings of each
and every user test, and scored them from 1-7 according to how well the user
performed based on the instructions in the video (1 = no coherence at all and
7 = no difference between video and performance). After all four students
scored the user tests, the average score were calculated. Fig. 5.11 demon-
strates the measured comparative performance between 2D and 3D scenarios.
Since task 1 and 2 are considered for learnability step, they are not included
in the chart. By analysis of the collected data, some important points can be
distinguished. Firstly, it is clear that the touchscreen interface works fine as
long as the task is limited to spinning/turning around x and y axes without
any translation or zooming. As soon as movements in the 3D or zooming come
to the action, the 3D gestural interface clearly shows its strength. Scoring on
tasks 3, 4, 6, 8 and 9 indicates this observation (task 3 was primarily limited
to turning the object). This fact is also proven by the comments of the users
80
Experimental Results
in the interview, where five users mentioned the fact that turning objects were
easier than turning and moving on touchscreen whereas combination of rota-
tions, translations and zooming is simpler with the 3D gestural interaction. It
may be too early to tell without future studies with larger user groups, but
according to the collected data in the chart, it seems that the 3D gestural
interaction is in overall the preferred system, due to the fact that users all
scored higher on that system (except one case) than the touchscreen interface.
During the test sessions two users had little to no experience in using touch-
screens (two male teachers in their sixties). These users formed the secondary
group in the usability evaluations. Although none of them owned a smartphone
or regularly used touchscreens, one of them had little experience in using a
Nintendo Wii, which might be the reason he scored substantially higher in the
3D interaction part, whereas the second user only scored a few points higher
than the touchscreen interface. It would undoubtedly seem that 3D gestural
interaction is the preferred system for people who learn both systems at the
same time. In fact, both of them scored higher on the 3D gestural interaction,
and during the interview both said that they strongly prefer the 3D interface
over touchscreen. However, they both mentioned that large hands and fingers
might cause problems on touchscreens.
Findings through the interviews reveal that users truly believe that 3D gestu-
ral interaction will be a standard interaction tool for the future applications.
Although, there are some differences in opinions about how wide it will be
used for future interactive scenarios. During the interviews, our users also
had to answer a few statements and respond to how they agreed with the
statement on a scale of 1-5 (1 = I do not agree at all, and 5 = I fully agree).
The results of this interview is reflected in Fig. 5.12. In the learnability step,
the main idea was to let users watch the videos and intuitively start following
the recorded motions instead of giving specific instructions. Based on the re-
sponses gathered from these questions it is clear that users find the 3D system
easy to learn and think that most people will learn to use a similar interface
quite easily. They mainly believe that with a bit more time in front of the 3D
81
Chapter 5
Figure 5.12: Qualitative results of 3D gestural interaction.
interface they would master it.
3D gestural interaction has a quicker learning curve than touchscreen interface
and obviously, better at performing more complex movements. However, in-
terfaces using 3D gestural interaction must be specifically designed for gesture-
based inputs, in the same way as applications for touchscreens are developed
differently from those on a desktop computer where mouse and keyboard are
available. This is already performed in games designed for Kinect and similar
products.
82
Chapter 6
Concluding Remarks andFuture Direction
6.1 Contributions
Today’s multimedia technology is highly inspired by two strong trends: tech-
nologies towards intuitive interaction and technologies towards augmented vi-
sualization. The former trend provides natural interaction technology for ef-
fective communication between users and smart devices. The latter trend,
which is considered as the direction to the fifth screen or augmented reality
visualization, combines the interactive experience on the personalized screen
with augmented information through the Internet. Technical contributions of
this thesis can support the development of both trends. In fact, 3D interaction
through intuitive gestures is unavoidable part of the future AR applications.
Therefore, defining new frameworks for effective interaction in augmented en-
vironments will improve the quality of user experience in future mobile ap-
plications. Basically, contributions of this thesis might be divided into three
main categories: Conceptual models for future human mobile device interac-
tion; Technical contributions towards 3D interaction design and interactive
visualization; Implementation of the proposed concepts and methods for dif-
ferent application scenarios.
83
Chapter 6
6.1.1 Conceptual Models for Future Human Mobile DeviceInteraction
This thesis proposes new concepts and frameworks for future human mobile
device interaction. The main features of the proposed ideas can be summa-
rized as follows.
ä Current and future trends in multimedia technology, especially on mobile
multimedia, future demands, challenges, limitations and directions are dis-
cussed in detail.
ä Evolution of the interaction and visualization facilities on mobile devices
and future trends are investigated.
ä Concept of extending the interaction and visualization spaces to 3D on mo-
bile devices is introduced and its advantages are discussed.
ä Concept of 3D gestural interaction on mobile devices is introduced and its
significant impacts are discussed.
ä Concept of collaborative tasks on mobile devices using bare- hand interac-
tion in 3D, and sharing the interaction and visualization spaces are discussed.
ä Potential application scenarios based on 3D gestural interaction are intro-
duced.
ä Concept of user-manipulated content and interactive 3D vision are intro-
duced and discussed.
6.1.2 Technical Contributions for 3D Gestural Interaction and3D Interactive Visualization
Technical contributions are the main focus of this thesis. New methods and
frameworks for 3D gesture analysis and interactive visualization have been
introduced. Specifically, technical contributions are mainly focused on two
major problems: first, interaction with mobile devices based on motion anal-
ysis in 3D space [38, 39, 54, 57], and second, 3D visualization on ordinary
2D displays [58, 59, 60, 61]. The introduced interactive systems are based
on detection, tracking, and analysis of the users’ 3D motion from the visual
input. This visual input might be received from the mobile device’s camera,
84
Concluding Remarks and Future Direction
body-mounted camera, webcam, and in general, from any type of vision sen-
sor. Technical contributions can be listed as below.
ä Concept of 3D gesture recognition and tracking, new methods and algo-
rithms based on low-level operators are discussed.
ä Novel methods and algorithms regarding the gesture recognition and track-
ing based on large-scale search framework is introduced and considered.
ä Proposed methods for 3D gesture analysis are compared and evaluated.
ä Technical solutions regarding the motion-based interactive 3D display are
introduced and compared in different configurations.
ä Different configurations in interaction between users and multimedia con-
tent in various scenarios and platforms are discussed.
ä New methods regarding 3D visualization of monocular images, photo col-
lections, and videos are investigated and discussed.
6.1.3 Implementations
Implemented scenarios based on the conceptual and technical contributions
can be summarized in the following items.
ä New methods for gesture analysis based on low-level patterns are imple-
mented.
ä New framework for 3D gesture analysis based on large-scale retrieval and
search methods are implemented.
ä 3D gesture detection and tracking are implemented in different platforms
(Windows, Mac OS X, Android).
ä Interactive 3D vision is implemented and tested in different scenarios from
personal smart devices to wall-sized display.
ä 3D visualization of monocular images, photo collections and videos are
implemented and tested.
85
Chapter 6
6.2 Concluding Remarks and Future Direction
Although today’s media industry is highly inspired by 3D technology, but
realistic interaction and visualization are still at their early stage of develop-
ment. Realistic visualization has attracted a lot of attention during the recent
decade. Introduction of 3D display technology in TVs, projectors and even
on mobile devices is the indication of the fast-growing 3D market. Strong
efforts towards changing the stereoscopic 3D to glasses-free 3D displays are
other indications of the general trend for intuitive and realistic visualization.
However, the current technology of 3D displays is quite different from real
human observation of the 3D world and significant improvements are required
to fulfill the objective of realistic visualization. Contributions of this thesis
in 3D visualization, especially the introduced concept and technology for in-
teractive 3D display, support the realistic and intuitive visualization. In fact,
the main idea behind interactive visualization is to enable users observe the
content, control the angle and viewpoint, in a similar manner to the real world
observation.
Introduction of 3D interaction facilities such as Microsoft Kinect has signifi-
cantly changed the way people interact with the digital content especially in
the entertainment area. Due to the fact that real 3D interaction requires ex-
tremely high accuracy in 3D motion estimation and tracking of the body joints,
there are still many unsolved issues and challenges to handle the difficulties.
However, strong indications reveal that future human mobile interaction will
be highly affected by intuitive 3D interaction. Contributions of this thesis
aimed to tackle the fundamental issues and propose novel ideas towards solv-
ing them.
In this thesis, 3D gestural interaction is deeply investigated as an effective
tool for future human mobile device interaction. Computer vision, pattern
recognition, and machine learning methods are widely used in this area. Ob-
servations and experimental results of this thesis indicate that although these
methods might be extremely useful to solve different challenges in 3D ges-
ture recognition, 3D motion analysis, etc., but for the generalized problem
86
Concluding Remarks and Future Direction
Figure 6.1: Different approaches for solving the technical challenges in mediatechnology. The current trend shows the gradual move from low-level featuresand high-level algorithms towards meta-data retrieval from large databases.
formulation they are not adequate. Therefore, new methods for 3D gesture
analysis through the large-scale retrieval system have been introduced. Due
to the possibility of storing and processing of extremely large databases and
the corresponding metadata, future methodology for solving the discussed
problems will be mostly centered around the metadata retrieval and search
methods instead of processing the low-level data. Thus, preparation of rich
and comprehensive databases can formulate the classical problems in a totally
new way. For instance, challenges of gesture recognition and tracking can be
slightly shifted from signal processing level to large-scale search and match-
ing frameworks. Although, image-based retrieval and template matching are
quite known concepts in media technology, but large-scale search framework
for gesture analysis is rather a new concept and needs further development.
This thesis has introduced and investigated this framework for high accu-
racy 3D motion retrieval and gesture tracking. Experimental results indicate
that the search framework is extremely powerful especially when recognition,
tracking and 3D motion retrieval are required all together, in large-scale and
87
Chapter 6
real-time. On the other hand, if we target specific patterns and models for
recognition and tracking, computer vision methods can handle the complexity
of the problems.
6.2.1 Technical Challenges
During this research various methodologies, algorithms and different approaches
towards solving the current and future challenges in media technology have
been considered. Some of the important technical challenges and findings
that have been tackled during this research work might be highlighted in the
following items.
6.2.1.1 Active vs. Passive Motion Capture
A common discussion in human motion analysis is the position where mo-
tion capture sensor should be mounted. In order to enhance the accuracy of
the tracking and convenience of the users, various configurations have been
introduced in different application scenarios. As discussed before, for more
intuitive and natural interaction design, marker-less bare-hand solutions are
preferred to wearable sensors such as motion capture gloves or body-mounted
devices.
Although the current motion analysis setups can be divided into passive and
active systems, in mobile devices or augmented reality glasses motion analysis
might be performed by both passive and active configurations. In fact, mo-
bile sensor can be used in static or dynamic modes. This possibility provides
a great chance to take the technical advantages of both configurations. For
instance, camera of the AR glass presents the advantages of active motion
analysis for a moving head, while it can be used as a passive sensor for hand
gesture tracking. This thesis has demonstrated the practical scenarios where
each configuration can show its advantages. For instance, the proposed in-
teractive 3D visualization employs the active vision for manipulation of the
content from larger distances (wall-sized display and projection) while the
same system is introduced with passive configuration for close-range interac-
88
Concluding Remarks and Future Direction
tion with mobile devices or laptops. As discussed before, in order to design
a realistic experience, for hand gesture interaction, passive configuration is
preferred. Thus, intuitive interaction should happen by using bare hands in
free space. However, the proposed technical solutions provide flexible designs
for different application scenarios.
6.2.1.2 Gesture Detection and Tracking without Intelligence
Majority of the available computer vision and pattern recognition methods hire
complex algorithms for gesture detection, recognition, and tracking. These
types of solutions usually include heavy computation and large training sets.
Obviously, for mobile systems with hardware and power limitations majority
of the common solutions are not applicable. The idea of employing low-level
operators for detecting and tracking hand gestures is for ensuring the efficient
detection without intelligence. Although, implementation of the effective ges-
ture analysis system without using high-level detection algorithms is quite
challenging but for efficiency reasons this important goal should be achieved.
Employing rotational symmetries for detecting and tracking bare-hand ges-
tures are based on this idea.
6.2.1.3 Adaptability of the Contributions to Future Hardware Evo-lution
Obviously, with the current rate of technology development, new types of sen-
sors will be introduced and embedded to the smart devices. Although the
proposed solutions of this thesis are mainly designed and tested based on the
current technology but in fact, they can perfectly fit the future environments.
Development of new sensors and extra hardware-related features can addition-
ally support the contributions and enhance the quality of the achieved results.
For instance, release of the Kinect sensor provided more flexibility to the pro-
posed concepts, designs and technologies due to its capability to provide the
additional depth information. Clearly, integration of the RGB images and
depth information can substantially improve the quality of detection, track-
89
Chapter 6
ing, noise removal, and etc.
Another example is the developing wearable AR glasses. Although presenting
the information through the AR glasses is not a new concept in media tech-
nology, but technical development of the recent years has made this concept
possible for implementation. Combination of the lightweight wearable display
and different types of sensors is the ideal scenario for gesture-based interaction
technology. Technical contributions of this thesis perfectly fit this area.
6.2.1.4 Contributions of other Research Areas to Computer Vision
Rapid development of other research areas might have strong contributions
to computer vision field. Solving the gesture analysis problems through the
search methods is formed based on this idea. Since search algorithms are exten-
sively used for text and document retrieval, modeling the gesture recognition
and tracking problems by common search methods such as indexing could ef-
fectively improve the research results. These findings reveal that breakthrough
technologies from other research areas might be successfully adapted to similar
concepts with totally different application scenarios. Basically, retrieving the
best gesture entry from a huge database of images is a similar concept to find-
ing the most related document to a searched text phrase. Thus, integration of
the classical computer vision and pattern recognition methods with enabling
technologies from other research fields can provide extremely powerful tools
for solving the technical challenges.
6.2.2 Further Development
There are quite a large number of application scenarios and configurations
that might benefit from the proposed technologies of this thesis. Due to the
fact that this research has been conducted during a limited period of time, it
was not possible to deeply investigate all the aspects of the proposed methods
such as user study and design features. Evaluations and experiments are
mainly performed on the technical aspects of the contributions. However, for
the implemented systems based on the proposed methods, user experience and
90
Concluding Remarks and Future Direction
design aspects are considered and studied in most cases. Here, some interesting
directions for further research and development might be mentioned.
6.2.2.1 Concept of Collaborative 3D Interaction
Development of the 3D interaction in collaborative scenarios is an interesting
line for further research. From technical perspective, collaborative 3D inter-
action can be implemented based on the proposed solutions of this thesis. As
discussed in the previous chapters, sharing the interaction/visualization spaces
among several users provides a great chance for numerous application scenar-
ios. Exchanging the digital information such as documents, photos, audio and
video tracks between different users is an example of collaborative sharing
based on 3D gestural interaction. In fact, users might grab, move and pass
the multimedia content in a shared 3D space using physical hand gesture.
6.2.2.2 Concept of Interaction in the Space using Body Gestures
Interaction between human and future smart devices can be extended to the
whole space. Specifically, by introducing the AR glasses, hand-held devices
will be removed and the whole space in front of the user can be dedicated
to the interaction. Contributions of this thesis have been focused on hand
gesture technology for interaction in front of the smart device and 3D head
motion estimation for interactive display. Since the interaction space can be
extended to a larger space, whole body motion for action recognition and other
body parts such as feet might by employed to design interactive application
scenarios.
6.2.2.3 Extension of the Gesture Search Framework to ExtremelyLarge Scale
The proposed search framework for gesture recognition and tracking has been
implemented and tested with different databases. The largest database has
been made by 10000 gesture entries. Although this number seems to contain
91
Chapter 6
quite large gesture poses for handling the gesture analysis problem, but ac-
cording to the estimations, for extremely high resolution tracking, the database
should be extended. One important line of research for further development is
to generalize the retrieval system to an extremely huge database. Confidently,
real-time matching process for extended database will be quite a challenging
problem.
6.2.3 Future of Mobile Interaction and Visualization
It is quite difficult to predict the evolution of smart devices within the next
ten years. From hardware point of view, the current trend shows that displays
might be presented with 3D technology, in transparent, flexible, or wearable
formats. Future devices will be definitely equipped with numerous sensors,
larger storage, and faster processors. High-speed mobile network connections
might totally change the design of the future smartphones. Storage and pro-
cessing can be shifted to the infrastructure and smart devices might act as a
set of sensors and screen for visualization.
Huge network of connected devices will provide a chance to share the digital
content in a virtual space. Collaborative interaction might be an important
part of the future mobile multimedia.
Concept of personalized environments and screens can totally change the fu-
ture of visualization. In fact, by introducing the AR glasses, any space can be
dedicated to the personalized environment. The whole space and augmented
information can be designed based on the user demands.
Future of interaction technology on mobile devices might be highly affected by
multimodal inputs. 3D technology for intuitive interaction will be definitely
an essential part of that. Using the bare-hand interaction in 3D space can
perform various tasks. Other input modalities such as voice, motion or orien-
tation will be complementary.
Here an important point to mention is that the natural interaction requires
the sense of touch. This is probably the essential feature that the free-space
interaction might need. Ultra-sound 3D rendering is an enabling technology
92
Concluding Remarks and Future Direction
that might be useful for implementation of the sense of touch in free space.
Due to the primary stage of research [83], availability of this technology for
mobile devices in the near future is quite questionable at this moment. This
technology might offer the rendering of virtual objects in free space.
93
Chapter 7
Summary of the SelectedArticles
This thesis reflects the results of the conducted research by Shahrouz Yousefi
during his PhD study. The first part of this thesis (introduction part), is
formed based on the result of more than 15 published papers in the interna-
tional conferences and journals. The publication list is included at the end
of this chapter. In the second part of this thesis 7 papers are included. In all se-
lected papers, Shahrouz Yousefi (author of this thesis) is the first/corresponding
author. Basically, major contributions of these seven papers including con-
cepts, theories, experiments, implementations and writing are from Shahrouz
Yousefi. Prof. Haibo Li has supervised Shahrouz Yousefi as the main super-
visor during this PhD study. Third author has assisted Shahrouz Yousefi in
some experiments or has participated in the discussions.
Chapter 8 introduces the gesture detection, tracking and 3D motion analy-
sis based on the first order and second order rotational symmetry patterns.
Rotational symmetries have been used for gesture localization and fingertip
detection. Feature detection, feature tracking, and 3D motion retrieval have
been performed. The computed motion parameters have been used to control
and manipulate the virtual objects on the screen. Various application scenar-
95
Chapter 7
ios that might benefit from the proposed technology have been introduced.
Content of this chapter has been published as a journal article in Pattern
Recognition Letters (PRL).
Content of chapter 9 is reprinted from the published paper at the ACM Inter-
national Conference on Multimedia (ACMMM12). This work has been pre-
sented at the conference in both oral and poster sessions. It has been selected
as one of the top eight papers for Doctoral Symposium track. The content of
this work has been evaluated by an opponent and committee members at the
conference.
This paper reflects the substantial part of this PhD thesis in brief. It intro-
duces the concept of 3D gestural interaction, potential applications, enabling
media technologies that support this concept, and the implemented photo
browsing system.
Chapter 10 introduces the concept of gesture analysis based on the large-
scale gesture retrieval and search engine. The introduced technology is based
on the provided database of annotated gesture images with the corresponding
3D pose information, and a search engine for similarity analysis between the
query gesture and the database entries. The output provides the best match
from the database and the annotated motion parameters will be used in real-
time interaction.
This paper is accepted for publication in the 9th International Conference on
Computer Vision Theory and Applications (VISAPP2014). This work has
successfully passed the novelty analysis step through the KTH Innovation.
Due to the patent application restrictions, the full version of this work with
technical details has not been submitted for publication in conferences or jour-
nals. The extended version of this work is filed as a U.S. patent application.
Chapter 11 introduces the interactive 3D visualization, the proposed tech-
nology for interaction between users and content of the display in real-time.
96
Summary of the Selected Articles
This technology enables users to control and manipulate the content of the
screen based on their position/orientation in 3D space. A head-mounted vi-
sion sensor is employed to measure and report the 3D motion parameters. The
real-time motion parameters will be sent to the rendering block for visualiza-
tion on the screen. This paper is submitted to the International Conference
on Image Processing (ICIP2014).
Content of chapter 12 is reprinted from the published paper at the Interna-
tional Conference on Signal Processing and multimedia applications (SIGMAP
2011). This paper introduces the technology for 3D visualization of monocular
images based on the patch-level depth retrieval. Stereoscopic techniques have
been used for 3D visualization on a normal 2D display.
Content of chapter 13 is the reprinted version of the published paper at the
IEEE International Conference on Wireless Communications and Signal Pro-
cessing (WCSP2011). This paper discusses the technology for converting the
2D monocular photo and video collections to 3D and visualizing them on 2D
displays using stereoscopic technology.
Chapter 14 introduces a vision-based technique for robust correction of 3D
geo-metadata in photo collections. The proposed technology efficiently im-
proves the accuracy of the position/orientation information in photo collec-
tions. Consequently, this approach enhances the 3D visualization, navigation,
and exploration of large data sets. Content of the chapter 14 is reprinted
from the published paper at the IEEE International Conference on Wireless
Communications and Signal Processing (WCSP2011).
7.1 List of Publications
Content of this thesis is based on the contributions of the following articles
but not including all of them:
97
Chapter 7
Journal articles:
p Shahrouz Yousefi, Farid Abedan Kondori, Haibo Li, Experiencing Real
3D Gestural Interaction with Mobile Devices, published in Pattern Recog-
nition Letters (PRLetters), 2013.
p Shahrouz Yousefi, Farid Abedan Kondori, Haibo Li, Gesture Tracking for
Real 3D Interaction Behind Mobile Devices, published in the International
Journal of Pattern Recognition and Artificial Intelligence (IJPRAI),
2013.
p Farid Abedan Kondori, Shahrouz Yousefi, Haibo Li, Direct Head Pose Es-
timation Using Kinect-type Sensors, published in Electronics Letters., 2014.
Licentiate thesis:
p Shahrouz Yousefi, Enabling Media Technologies for Mobile Photo Brows-
ing, Licentiate Thesis, Digital Media Lab (DML), Department of Applied
Physics and Electronics, Umea University, SE-901 87, Umea, Sweden, ISSN:
1652-6295:16, ISBN: 978-91-7459-426-3, 2012.
Conference papers:
p Shahrouz Yousefi, Farid Abedan Kondori, Haibo Li, Bare-hand Gesture
Recognition and Tracking through the Large-scale Image Retrieval, accepted
for publication in the 9th International Conference on Computer Vi-
sion Theory and Applications (VISAPP), January, 2014.
p Shahrouz Yousefi, 3D Photo Browsing for Future Mobile Devices, In
Proceeding of the 20th ACM International Conference on Multime-
dia (ACMMM12), October 29-November 2, Nara, Japan, 2012.
98
Summary of the Selected Articles
p Farid Abedan Kondori, Shahrouz Yousefi, Haibo Li, Real 3D Interaction
Behind Mobile Phones for Augmented Environments, In Proceeding of the
IEEE International Conference on Multimedia and Expo (ICME),
Barcelona, Spain, July 2011.
p Shahrouz Yousefi, Farid Abedan Kondori, Haibo Li, 3D Gestural In-
teraction for Stereoscopic Visualization on Mobile Devices, In Proceeding of
the 14th International Conference on Computer Analysis of Images
and Patterns (CAIP), Seville, Spain, CAIP (2), Vol. 6855 Springer (2011)
, p. 555-562, 29-31 August 2011.
p Shahrouz Yousefi, Farid Abedan Kondori, Haibo Li, 3D Visualization
of Single Images using Patch Level Depth, In Proceedings of the Interna-
tional Conference on Signal Processing and Multimedia Applica-
tions (SIGMAP), Seville, Spain, 18-21 July, 2011.
p Shahrouz Yousefi, Farid Abedan Kondori, Haibo Li, Stereoscopic Vi-
sualization of Monocular Images in Photo Collections, In Proceeding of the
IEEE International Conference on Wireless Communications and
Signal Processing (WCSP), Nanjing, China, p. 1 - 5, 9-11 Nov. 2011.
p Shahrouz Yousefi, Farid Abedan Kondori, Haibo Li, Robust Correction
of 3D Geo-Metadata in Photo Collections by Forming a Photo Grid, In Pro-
ceeding of the IEEE International Conference on Wireless Communi-
cations and Signal Processing (WCSP), Nanjing, China, p. 1 - 5, 9-11
Nov. 2011.
p Shahrouz Yousefi, Farid Abedan Kondori, Haibo Li, Tracking Fingers
in 3D Space for Mobile Interaction, In Proceeding of the 20th International
Conference on Pattern Recognition (ICPR), Istanbul, Turkey, 2010.
99
Chapter 7
Under-review articles:
p Shahrouz Yousefi, Farid Abedan Kondori, Haibo Li, 3D Hand Gesture
Recognition and Tracking through the Large-scale Gesture Search Engine, Sub-
mitted to the IEEE Transactions on Pattern Analysis and Machine
Intelligence (TPAMI), 2014.
p Shahrouz Yousefi, Farid Abedan Kondori, Haibo Li, Interactive 3D Visu-
alization on a 4K Wall-Sized Display, Submitted to the IEEE International
Conference on Image Processing (ICIP 2014)., Paris, France, 2014.
p Farid Abedan Kondori, Shahrouz Yousefi, Li Liu, Haibo Li, Direct Hand
Pose Estimation for Immersive Gestural Interaction, Submitted to Pattern
Recognition Letters (PRLetters)., 2014.
p Farid Abedan Kondori, Shahrouz Yousefi, Ahmad Ostovar, Li Liu, Haibo
Li, A Direct Method for 3D Hand Pose Recovery, Submitted to the 22nd In-
ternational Conference on Pattern Recognition (ICPR 2014)., Stock-
holm, Sweden, 2014.
Other related publications:
p Farid Abedan Kondori, Shahrouz Yousefi, Li Liu, Haibo Li, Head Oper-
ated Electric Wheelchair, accepted for publication in the IEEE Southwest
Symposium on Image Analysis and Interpretation (SSIAI 2014), San
Diego, USA, 2014.
p Farid Abedan Kondori, Shahrouz Yousefi, Haibo Li, Samuel Sonning,
Sabina Sonning, 3D Head Pose Estimation Using the Kinect, In Proceeding
of the 2011 IEEE International Conference on Wireless Communica-
100
Summary of the Selected Articles
tions and Signal Processing (WCSP), Nanjing, China, Nov. 2011.
p Farid Abedan Kondori, Shahrouz Yousefi, Smart Baggage In Aviation,
In Proceeding of the 2011 IEEE International Conference on Internet
of Things (iThings-11), Dalian, China, 2011.
p Shahrouz Yousefi, Farid Abedan Kondori, Haibo Li, 3D Visualization
of Monocular Images in Photo Collections, In Proceeding of the Swedish
Symposium on Image Analysis (SSBA), Linkoping, Sweden, 2011.
p Farid Abedan Kondori, Shahrouz Yousefi, Haibo Li, Gesture Tracking for
3D Interaction in Augmented Environments, In Proceeding of the Swedish
Symposium on Image Analysis (SSBA), Linkoping, Sweden, 2011.
p Farid Abedan Kondori, Shahrouz Yousefi, Haibo Li, , In Proceeding
of the Swedish Symposium on Image Analysis (SSBA), Gothenburg,
Sweden, 2013.
Patents:
p Shahrouz Yousefi, Haibo Li, Farid Abedan Kondori, Real-time 3D Gesture
Recognition and Tracking System for Mobile Devices, U.S. Patent Applica-
tion, filed January 2014. Patent Pending.
101