Copyright text: Sarah B Anderson ... · Copyright text: Sarah B Anderson Cover Art: William Warren
Copyright by William John Blanke 2001
Transcript of Copyright by William John Blanke 2001
Copyright
by
William John Blanke
2001
The Dissertation Committee for William John BlankeCertifies that this is the approved version of the following dissertation:
Multiresolution Techniques on a Parallel Multidisplay
Multiresolution Image Compositing System
Committee:
Chandrajit Bajaj, Supervisor
Don Fussell
Vijay Garg
Margarida Jacome
Roy Jenevein
Multiresolution Techniques on a Parallel Multidisplay
Multiresolution Image Compositing System
by
William John Blanke, B.S.E.,M.S.
DISSERTATION
Presented to the Faculty of the Graduate School of
The University of Texas at Austin
in Partial Fulfillment
of the Requirements
for the Degree of
DOCTOR OF PHILOSOPHY
THE UNIVERSITY OF TEXAS AT AUSTIN
December 2001
Dedicated to Vero.
Acknowledgments
Even though this dissertation lists my name as the author, many people
were instrumental in bringing it to completion. I would like to mention a few of
these names here in appreciation. However, to avoid the risk of leaving anyone
out before doing so I would first like to thank the faculty, students, and staff
of The University of Texas in general. A dissertation involves a lot of advice,
help, mentoring and perhaps most of all paperwork. Without the collective
assistance of The University as a whole, there would be little chance of my
research and the documentation appearing here finding its place in print.
Dr. Don Fussell and Dr. Chandrajit Bajaj started my interest in image
compositing systems. The original ideas for developing the Metabuffer can be
attributed to them. Dr. Fussell especially took an active role in flushing out
the preliminary plans for implementing the Metabuffer. Later, Dr. Bajaj
provided an enormous amount of time and energy suggesting how to simulate
the Metabuffer and adapt it to the cluster. He also offered a great environment
to do the work. I feel privileged to have been able to use the top quality
facilities offered by the visualization lab.
I would also like to thank the other members of my committee: Dr.
Vijay Garg, Dr. Margarida Jacome, and Dr. Roy Jenevein. With the advent
v
of the DVI (Digital Visual Interface) standard, image composition has become
a hot research area. I would like to thank the members of my committee for
bearing with me while my research topic bent and swayed with the rapid twists
and turns of developments in this area.
My software engineering courses taught me to concentrate on how to
use available components as technologies to match with the architecture of
my designs. With the Metabuffer project, this was especially true. Wherever
possible, I employed libraries to implement portions of the system. Because
of this, I have a number of people to thank for offering their code to the
public domain free of charge. First Sam Leffler at SGI for his TIFF image
compression library. I am not sure how many TIFF images I generated in
running the Metabuffer simulator and emulator, but I am sure it must be
over one million. I would also like to thank the team that wrote the MPICH
implementation of MPI, and the pthreads for Windows team which forms the
threading and synchronization base of the Metabuffer simulator. I would also
like to thank Mark Kilgard of GLUT fame, which allowed the Metabuffer
project to move swiftly and easily from Windows, to IRIX, and finally to
Linux without incurring any user interface headaches. Finally, the OCview
library, currently maintained by Xiaoyu Zhang, a fellow CS graduate student,
performed the rendering for the Metabuffer emulator. I am indebted to him
for his personal assistance in adapting his code for my project as well as in
generating the many isosurface data sets seen throughout this dissertation.
vi
In addition to Xiaoyu, several other CS graduate students greatly as-
sisted me in my research. James Yang was instrumental in setting up the
Prism cluster for hosting the Metabuffer. This was no small task given the
atypical custom requirements of adding high performance graphics cards to a
computing cluster. I would also like to thank Christian Sigg for his work au-
tomating much of the cluster’s processes. Even after both of these people had
departed UT, the cluster continued to function without any major issues–a
testament to the quality of their work.
None of this research could ever hope to have been completed without
some major help from the staff in Computer Sciences. Reuben Reyes especially
fielded all kinds of requests and offered any assistance I needed. I would like
to thank Patricia Baxter in TICAM and Melanie Gulick in EE for fixing my
many paperwork mistakes and dealing with my perpetual procrastinating in
all things involving form deadlines.
I consider many of my past professors at previous universities to be
some of my greatest role models. The impact these people had in my studies
influenced me to want to continue with my graduate education. I would like to
thank Dr. Stephen Jones at The University of Virginia and Dr. John Board
at Duke University. Both professors advised me during my stays at those
institutions and I can only hope to be the kind of educator that they have
become; they inspire others to want to learn.
I can never say enough thanks to The University of Texas and the
vii
Cockrell Foundation for offering me the chance to pursue my graduate degree.
With the funding of the MCD scholarship and the Cockrell fellowship, it was
possible for me to fully commit to learning and research instead of worrying
about dollars and cents. Grants contributed by the National Science Founda-
tion also provided additional support. Their role in graduate education can
not be overstated.
viii
Multiresolution Techniques on a Parallel Multidisplay
Multiresolution Image Compositing System
Publication No.
William John Blanke, Ph.D.
The University of Texas at Austin, 2001
Supervisor: Chandrajit Bajaj
In most computer graphics applications resolution is a tradeoff. Using low-
resolution images provide a low quality display, but typically allow higher
frame rates because less data needs to be computed. High-resolution images,
on the other hand, give the best display, yet are hindered by slower refresh
times and thus limit user interactivity. Low image quality and low user inter-
activity are both detriments to computer graphics visualization applications.
The question then is what can be done to minimize this impact.
The aim of this dissertation is to explore how to use multiresolution
in order to provide the best balance between image quality and user inter-
activity on a parallel multidisplay multiresolution image compositing system
with antialiasing called the Metabuffer. The architecture of the Metabuffer,
a simulator written in C++, and a Beowulf cluster based emulator are fully
described in this dissertation. Additional supporting hardware and software
ix
detailed in this document include an algorithm to partition data sets into
Metabuffer viewports and a wireless visualization control device.
Using the Beowulf cluster based Metabuffer emulator, two multires-
olution techniques are studied: progressive image composition and foveated
vision. Progressive image composition allows the user to rapidly change view-
points without immediately moving data between PCs. Instead, the resolution
of each PC’s viewport adjusts in order to cover the visible polygons for which it
is responsible. The larger, low-resolution viewports have lower image quality,
but the user sees no drop in frame rate. Over time, the PCs can readjust their
data in order to shrink their viewports and provide high-resolution imagery.
Foveated vision allows computing resources to be concentrated only where the
user is actually focused. Human peripheral vision cannot discern high lev-
els of detail. Rendering the periphery with a low polygon count using a few
low-resolution viewports allows the majority of the machines to render high-
resolution viewports only where the user (or users) are looking thus increasing
the frame rate.
x
Table of Contents
Acknowledgments v
Abstract ix
List of Tables xvii
List of Figures xviii
Chapter 1. Introduction 1
1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
Chapter 2. Background and Related Work 9
2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.2 Sort First . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12
2.2.1 Recent Multidisplay Systems . . . . . . . . . . . . . . . 12
2.3 Sort Middle . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
2.4 Sort Last . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
2.4.1 Recent Single Display Systems . . . . . . . . . . . . . . 18
2.4.2 Recent Multidisplay Systems . . . . . . . . . . . . . . . 20
2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
xi
Chapter 3. Metabuffer Architecture 25
3.1 Metabuffer Architecture . . . . . . . . . . . . . . . . . . . . . 25
3.2 Bus Dataflow . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.2.1 Analysis of Bus Data Flow . . . . . . . . . . . . . . . . 29
3.2.2 Buffering of Bus Data Flow . . . . . . . . . . . . . . . . 34
3.3 IRSA Round Robin Bus Scheduling . . . . . . . . . . . . . . . 35
3.4 Sequence of Metabuffer Operations . . . . . . . . . . . . . . . 36
3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
Chapter 4. Metabuffer Simulator 40
4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40
4.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 41
4.3 Multiresolution Output . . . . . . . . . . . . . . . . . . . . . . 42
4.4 Antialiasing Output . . . . . . . . . . . . . . . . . . . . . . . . 44
4.5 Transparency Output . . . . . . . . . . . . . . . . . . . . . . . 46
4.5.1 Interpolated Transparency . . . . . . . . . . . . . . . . . 47
4.5.2 Multipass Methods . . . . . . . . . . . . . . . . . . . . . 48
4.5.3 Screen Door . . . . . . . . . . . . . . . . . . . . . . . . 49
4.5.4 Metabuffer Implementation . . . . . . . . . . . . . . . . 49
4.6 Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53
Chapter 5. Metabuffer Emulator 54
5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54
5.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 55
xii
5.2.1 Granularity . . . . . . . . . . . . . . . . . . . . . . . . . 55
5.2.2 MPI Mapping . . . . . . . . . . . . . . . . . . . . . . . 57
5.2.3 Plugin API . . . . . . . . . . . . . . . . . . . . . . . . . 58
5.3 Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59
5.3.1 Plugins . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.3.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3.3 Undocumented Features . . . . . . . . . . . . . . . . . . 62
5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63
Chapter 6. Greedy Viewport Allocation Algorithm 64
6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64
6.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65
6.2.1 Sort First Algorithms . . . . . . . . . . . . . . . . . . . 65
6.2.2 Sort Last Techniques . . . . . . . . . . . . . . . . . . . . 67
6.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 70
6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73
6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76
Chapter 7. Wireless Visualization Control Device 77
7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77
7.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
7.2.1 Ubiquitous Computing . . . . . . . . . . . . . . . . . . . 79
7.2.2 Augmented Reality . . . . . . . . . . . . . . . . . . . . 80
7.2.3 Context-Aware Applications . . . . . . . . . . . . . . . 81
7.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 81
xiii
7.4 Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86
7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
Chapter 8. Progressive Image Composition Plugin 90
8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
8.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
8.2.1 Progressive Transmission . . . . . . . . . . . . . . . . . 92
8.2.2 Progressive Refinement . . . . . . . . . . . . . . . . . . 93
8.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 94
8.3.1 Initial Triangle Assignment . . . . . . . . . . . . . . . . 94
8.3.2 Viewport and Resolution Determination . . . . . . . . . 95
8.3.3 Data Exchange . . . . . . . . . . . . . . . . . . . . . . . 100
8.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101
8.4.1 Oceanographic . . . . . . . . . . . . . . . . . . . . . . . 103
8.4.2 Santa Barbara . . . . . . . . . . . . . . . . . . . . . . . 106
8.4.3 Visible Human . . . . . . . . . . . . . . . . . . . . . . . 109
8.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112
Chapter 9. Foveated Vision Plugin 114
9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
9.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116
9.2.1 Image Processing . . . . . . . . . . . . . . . . . . . . . . 116
9.2.2 Image Transmission . . . . . . . . . . . . . . . . . . . . 117
9.2.3 Image Generation . . . . . . . . . . . . . . . . . . . . . 118
9.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 119
xiv
9.3.1 Continuous Method . . . . . . . . . . . . . . . . . . . . 120
9.3.2 Discrete Method . . . . . . . . . . . . . . . . . . . . . . 122
9.3.3 Load Balancing . . . . . . . . . . . . . . . . . . . . . . . 123
9.3.4 Compositing . . . . . . . . . . . . . . . . . . . . . . . . 127
9.3.5 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . 128
9.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
9.4.1 Visible Human . . . . . . . . . . . . . . . . . . . . . . . 130
9.4.2 Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . 134
9.4.3 Skeleton . . . . . . . . . . . . . . . . . . . . . . . . . . . 137
9.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141
Chapter 10. Conclusion and Future Work 143
10.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143
10.2 Limitations of the Metabuffer . . . . . . . . . . . . . . . . . . 146
10.3 Limitations of the Applications . . . . . . . . . . . . . . . . . . 147
10.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148
10.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149
Appendix 150
Appendix A. Simulator Classes 151
A.0.1 Class CClock . . . . . . . . . . . . . . . . . . . . . . . . 151
A.0.2 Class CComposerPipe . . . . . . . . . . . . . . . . . . . 153
A.0.3 Class CComposerQueue . . . . . . . . . . . . . . . . . . 159
A.0.4 Class CInFrameBus . . . . . . . . . . . . . . . . . . . . 161
A.0.5 Class COutFrame . . . . . . . . . . . . . . . . . . . . . 168
xv
Appendix B. Emulator Distribution 174
B.1 Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174
B.2 Building the Metabuffer Emulator . . . . . . . . . . . . . . . . 175
B.2.1 glut-3.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . 175
B.2.2 tiff-v3.5.5 . . . . . . . . . . . . . . . . . . . . . . . . . . 178
B.2.3 ocview . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
B.2.4 emu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179
B.3 Running the Metabuffer Emulator . . . . . . . . . . . . . . . . 180
Bibliography 182
Vita 190
xvi
List of Tables
2.1 Current parallel rendering systems . . . . . . . . . . . . . . . 11
3.1 Viewport control information . . . . . . . . . . . . . . . . . . 28
3.2 Case one: bandwidth analysis . . . . . . . . . . . . . . . . . . 30
3.3 Case two: bandwidth analysis . . . . . . . . . . . . . . . . . . 31
3.4 Case three: bandwidth analysis . . . . . . . . . . . . . . . . . 33
3.5 Case four: bandwidth analysis . . . . . . . . . . . . . . . . . . 34
8.1 Progressive data set information . . . . . . . . . . . . . . . . . 102
9.1 Foveated data set information . . . . . . . . . . . . . . . . . . 129
xvii
List of Figures
2.1 SHRIMP zoom out timings for horse model . . . . . . . . . . 15
3.1 Metabuffer architecture . . . . . . . . . . . . . . . . . . . . . . 26
3.2 Case one: single screen viewport . . . . . . . . . . . . . . . . . 30
3.3 Case two: four screen viewport . . . . . . . . . . . . . . . . . 31
3.4 Case three: four screen low resolution viewport . . . . . . . . 32
3.5 Case four: nine screen low resolution viewport . . . . . . . . . 33
4.1 Simulator class instance organization . . . . . . . . . . . . . . 42
4.2 Rayshade generated input images with viewport configuration 43
4.3 Composited simulator output images . . . . . . . . . . . . . . 44
4.4 Zoomed image without (left) and with (right) antialiasing . . . 45
4.5 Screen door transparency Metabuffer output . . . . . . . . . . 50
4.6 Zoom of transparency example . . . . . . . . . . . . . . . . . . 51
5.1 Emulator class instance organization . . . . . . . . . . . . . . 56
6.1 Viewport configuration for horse example. . . . . . . . . . . . 73
6.2 Greedy algorithm timings for various model sizes . . . . . . . 75
7.1 Wireless visualization device user interface . . . . . . . . . . . 83
7.2 Wireless visualization operation . . . . . . . . . . . . . . . . . 85
xviii
8.1 Asymmetrical frustum illustration . . . . . . . . . . . . . . . . 99
8.2 Sample frames from the oceanographic movie . . . . . . . . . 104
8.3 Rendering times for oceanographic movie frames . . . . . . . . 105
8.4 Sample frames from the Santa Barbara movie . . . . . . . . . 107
8.5 Rendering times of Santa Barbara movie frames . . . . . . . . 108
8.6 Sample frames from the visible human movie . . . . . . . . . . 110
8.7 Rendering times for visible human movie frames . . . . . . . . 111
8.8 Composited visible human in visualization lab . . . . . . . . . 111
9.1 Coren’s acuity graph . . . . . . . . . . . . . . . . . . . . . . . 119
9.2 Foveated pyramid for visible human example . . . . . . . . . . 125
9.3 Sample frames from the visible human movie . . . . . . . . . . 132
9.4 Rendering times for visible human movie frames . . . . . . . . 133
9.5 Sample frames from the engine movie . . . . . . . . . . . . . . 135
9.6 Rendering times for engine movie frames . . . . . . . . . . . . 136
9.7 Sample frames from the skeleton movie . . . . . . . . . . . . . 138
9.8 Rendering times for skeleton movie frames . . . . . . . . . . . 139
xix
Chapter 1
Introduction
1.1 Motivation
In most computer graphics applications resolution is a tradeoff in terms
of frame rate. Using low resolution images provide a low quality display, but
typically allow higher frame rates because less data needs to be computed.
High resolution images, on the other hand, give the better display quality, yet
are hindered by slower refresh times and thus limit user interactivity. Low im-
age quality and low user interactivity are both detriments to computer graphics
visualization applications. The question then is what can be done to minimize
this impact.
1.2 Background
Probably the most well known example of this tradeoff is the popular
computer game, Quake [25]. The Quake user faces three choices. One, he or
she can run the game in the highest resolution the computer can currently
support yielding a beautiful visual experience. Doing so, however, will likely
drop the frame rate of the game, and thus limit how well the Quake user can
1
interact with the environment–essentially the other Quake participants playing
concurrently in online Quake death matches (games where opponents do battle
against each other in a computer generated simulation). The reduced user
interactivity will cause the Quake user to become easy prey for murderous co-
players. Two, the Quake user can decide to use the lowest resolution possible.
The display is terrible, but the frame rate is quick and the player’s responses
are as well. The Quake user is now competitive with the rest of the players in
the death match. Three, the user can opt to upgrade his or her system to a
faster processor and video card by spending hundreds or thousands of dollars.
This will result in great graphics and quick response, though perhaps a much
lighter wallet. The choices most Quake players make are obvious. Those with
trust funds choose three. Those on work-study grants choose two.
In the field of scientific visualization, money concerns are, to an extent,
less important than results. If it were possible to improve a visualization ap-
plication by merely spending more money on a faster processor or a better
performing rendering board, it would likely be done. High priced SGI com-
puting platforms, for instance, sell in low, but profitable, quantities. In most
cases, the money spent on hardware is more than offset by the time saved and
capabilities garnered.
However, today imaging and simulations are increasingly yielding larger
and larger data streams. These data sets can range in size from gigabytes to
terabytes of information. Such data sets are much too large to store and
2
render on a single machine–even a pricey SGI. Viewing these large data sets
poses yet another problem. In some cases the detail allowed by a single high
performance monitor may not be adequate for the resolution required. To
cope with these issues, many systems have been designed which use parallel
computation and tiled screen displays. Dividing the data set among a number
of computers reduces its enormous bulk to more reasonably sized chunks that
can be quickly rendered. Likewise, using tiled displays results in a larger
amount of display space. Small details that might be culled out on a single
monitor can be spotted in an immersive visualization laboratory with hundreds
of square feet of screen space.
These current parallel, multidisplay systems share common problems,
however. Because they all depend on data locality in some type of form (di-
viding the data set evenly among the processors), changing the viewpoint of
the user can often wreck any careful load balancing done on the data set.
An unevenly load balanced data set will significantly degrade the frame rate
which a user experiences. Even worse, in some cases if the tiled displays are
linked only to certain machines, large quantities of data or pixels may need to
be moved immediately simply to render the frame correctly. This can result
in a significant delay to the user. Also, large tiled displays require immense
amounts of computing resources to render. This is despite the fact that, in
most cases, much of the display is either not in the user’s view or is only within
the user’s peripheral vision. Current parallel, multidisplay systems are limited
in how they can allocate their computing resources to cope with a partially
3
viewed scene in order to accelerate the possible frame rate.
The thesis of my research is that multiresolution techniques can elim-
inate data locality and resource allocation problems in parallel multidisplay
systems that render interactive large scale data streams by providing an es-
sential balance between display quality and frame rate.
1.3 Contributions
The primary contributions of this dissertation are:
1. The architecture for a parallel multidisplay multiresolution im-
age compositing system: This architecture, called the Metabuffer,
is flexible enough that the number of rendering servers can scale in-
dependently from the number of display tiles. In addition, since the
Metabuffer allows the viewports to be located anywhere within the to-
tal display space and overlap each other, it is possible to achieve a much
higher degree of load balancing. Since the viewports can vary in size, the
system supports multiresolution rendering, for instance allowing a single
machine to render a background at low resolution while other machines
render foreground objects at much higher resolution. The architecture
also supports antialiasing and transparency.
2. The Metabuffer hardware simulator written in C++: To test
the architecture of the Metabuffer, a simulator was written to mimic the
4
hardware in C++. The major components of the Metabuffer architecture
were coded as classes. By creating or deleting instances of the classes,
it is possible to easily test large or small Metabuffer configurations. The
simulator proves that the architecture can perform parallel, multidisplay,
multiresolution image compositing without glitches.
3. The Metabuffer emulator running on a Beowulf cluster using
MPI and GLUT: In order to test applications developed for the Meta-
buffer, a emulator was written in software that mimics the operation
of the hardware but is encoded to perform as efficiently as possible on
the Beowulf cluster. While sort last systems running completely in soft-
ware are possible [39], because the approach of the Metabuffer hardware
depends on heavily parallel I/O and pipelined compositing, the limited
I/O and single processors of the individual cluster machines are not ide-
ally suited to emulating it. The large communication requirements of
so much pixel data make it difficult to map the Metabuffer architec-
ture to a standard cluster with machines that have only a single limited
bandwidth system bus. In addition, adding large numbers of machines
to a cluster to achieve pipelined computation streams causes the com-
putation granularity to be too fine relative to communication overhead.
This greatly reduces efficiency. It is for these reasons that sort last sys-
tems such as the Metabuffer usually require hardware implementations
rather than running in software. However, a workable, though not scal-
able, implementation of the Metabuffer has been created in software with
5
coarse parallel granularity using the MPI library to pass Metabuffer I/O
over the Beowulf cluster’s network connections and the GLUT library (a
cross-platform GUI layer for OpenGL [46] applications) to render and
display image data. A plugin API is used with this emulator testbed
to write applications which interface to the Metabuffer using only a few
standard calls.
4. A greedy algorithm for creating Metabuffer viewports to cover
the data set in order to render all polygons: In order to quickly
divide data sets into even chunks for the rendering servers to process, a
greedy algorithm was developed that uses a simple heuristic to partition
the polygons in a quick and hopefully load balanced manner.
5. Wireless visualization control device: Using Pocket PC devices
equipped with wireless Ethernet, a Windows CE client application was
written in conjunction with a Linux server to allow multiple users to re-
motely control the operations of the Metabuffer emulator plugins. Sim-
ply tapping the display of the Pocket PC device controls the orientation
of objects being viewed. The control device is also currently being used
to position the lines of sight of users for the foveated vision plugin until a
wireless gaze tracking headset is available. In the future the device may
feature region of interest (ROI) tracking in which user history, current
viewpoint, and object features are all taken into account. Collaborative
user interface ideas could also be explored when multiple devices interact
with the same display.
6
6. Progressive image compositing using the multiresolution capa-
bilities of the Metabuffer: A Metabuffer emulator plugin was writ-
ten to test the possibilities of using multiresolution for progressive image
compositing. If the user happens to change views of a scene, and poly-
gons local to a rendering server no longer fit within a high resolution
viewport, that viewport can enlarge and become low resolution, rather
than necessitating the shifting of polygons to other rendering servers. In
this way the user’s frame rate remains constant. When the user stops at
a scene to study it further, the polygons can be redistributed in order to
form high resolution viewports once again. This technique is analogous
to progressive refinement in the case of World Wide Web images. The
user can navigate quickly through web pages containing low resolution
images. When he or she finally arrives at the correct page, only then are
high resolution images downloaded.
7. Foveated vision using the multiresolution capabilities of the
Metabuffer: A Metabuffer emulator plugin was written to test the
possibilities of using multiresolution for foveated vision applications. The
human eye cannot discern high levels of detail in its peripheral vision.
This can be exploited by rendering the periphery using lower polygon
counts and lower resolution. Large areas of screen space can be rendered
by only a few rendering servers. Meanwhile, the majority of rendering
machines concentrate their work only where the user is actually looking.
This makes efficient use of rendering resources, especially in cases where
7
the display space is quite large and thus improves the user’s frame rate.
A chapter in this dissertation is devoted to each of these contributions.
Chapter 10 summarizes some of the limitations of this research and proposes
avenues for future work.
8
Chapter 2
Background and Related Work
2.1 Introduction
Today imaging and simulation applications are increasingly yielding
larger and larger data streams. Visualizing these large data streams inter-
actively may be difficult or impossible with a single computer. Because of
this, many research groups have studied the problem of visualizing data sets
in parallel. Schneider analyzes the suitability of PCs for parallel rendering of
single and multiple frames on symmetric multiprocessors and clusters [45]. In
general, most of these parallel rendering systems, with the notable exceptions
of hybrid systems such as Pomegranate [11], can be classified into three dif-
ferent categories depending on where the data is sorted from object-space to
image-space as shown by Molnar [36]. Crockett [10] describes various consid-
erations in building parallel systems and the tradeoffs associated with these
three categories.
Even with powerful parallel systems to render the data, in some cases
single high performance monitors may not have adequate resolution to resolve
the detail of large data sets. The use of multiple displays in tiled configura-
9
tions is an accepted way to gain very high resolution displays. Using separate
displays to display a single image, of course, has a few problems. Issues with
aligning the images of the multiple displays have been studied by both Chen [8]
and Raskar [42]. Once the images are aligned, color variations between the dis-
plays and even across the displays themselves has to be corrected. Majumder
[33] deals with the color uniformity question.
This chapter describes some of the recent systems created by others in
the parallel rendering arena and shows where the work with the Metabuffer
fits in this group. The systems are divided according to Molnar’s three sorting
categories and further subdivided by whether they work with single or multiple
displays. Section 2.2 discusses sort-first parallel rendering systems and their
tradeoffs. Section 2.3 talks about the sort-middle technique (rarely used for
cluster configurations). Section 2.4 lists the sort-last rendering systems (the
category to which the Metabuffer belongs). Each category has its benefits
and its drawbacks, and these issues are discussed in each section. Finally
section 2.5 describes the reasoning for choosing the sort last method for the
Metabuffer and why this method lends itself better to multiresolution support
than the others. Figure 2.1 is an overview of this chapter and shows each
parallel rendering system and its feature set properly classified.
10
Syst
emD
evel
oper
Cla
ssD
ispl
ayA
rchi
tect
ure
Pom
egra
nate
Stan
ford
Hyb
rid
Sing
leC
usto
mre
nder
ing
hard
war
eW
ireG
LSt
anfo
rdSo
rtFir
stM
ulti
ple
Com
puti
ngcl
uste
rSH
RIM
PP
rinc
eton
Sort
Fir
stM
ulti
ple
Com
puti
ngcl
uste
rP
ixel
Flo
wU
NC
Sort
Las
tSi
ngle
Cus
tom
rend
erin
gha
rdw
are
Sepi
aC
alTec
hSo
rtLas
tSi
ngle
Serv
erN
etII
w/F
PG
Abo
ards
Lig
htni
ng-2
Inte
lSo
rtLas
tM
ulti
ple
Cus
tom
com
posi
ting
hard
war
eM
etab
uffer
UT
Aus
tin
Sort
Las
tM
ulti
ple
Cus
tom
com
posi
ting
hard
war
e
Tab
le2.
1:C
urre
ntpa
ralle
lren
deri
ngsy
stem
s
11
2.2 Sort First
In the sort-first approach, the display space is broken into a number
of non-overlapping display regions which can vary in size and shape. Be-
cause polygons are assigned to the rendering process before geometric process-
ing, sort-first methods may suffer from load imbalance in both the geometric
processing and rasterization if polygons are not evenly distributed across the
screen partitions.
2.2.1 Recent Multidisplay Systems
WireGL
The WireGL software suite [24] takes an innovative approach to parallel
rendering. Essentially, it is transparent to the hosting application. WireGL
replaces the standard OpenGL dynamic link library used with Microsoft’s
operating systems. Instead of processing OpenGL commands and sending
the results to a local display as the standard OpenGL library would do, the
WireGL library sorts the OpenGL commands depending on screen location and
then transmits these commands over a high speed network to remote servers.
The servers then perform the actual rendering and show the results on their
own local display. This can effectively allow for a large multitiled display
without any modifications to the hosting application. In fact, a favorite test
application of the WireGL team is the computer game Quake, mentioned at
the start of this dissertation, which is reported to have playable interactive
12
frame rates when running under WireGL on a large tiled display.
Care must be taken to parse the OpenGL command stream properly.
OpenGL works like a state machine, so splitting the command stream among
several servers must ensure that commands are correctly placed to keep all the
machines in the proper mode. WireGL does this by duplicating some com-
mands, offsetting this by, interestingly enough, culling needless repetition in
the OpenGL stream. Apparently C++ programs are notorious for reinitializ-
ing OpenGL state even when not really necessary.
There are a few drawbacks to using this approach, however. Polygons
must be distributed from a central server to multiple outlying renderers. This
by itself limits the scalability and hence the usefulness of the system for ren-
dering large data sets. Like all sort first systems, WireGL suffers from load
imbalance due to nonhomogeneous polygon distribution. Also many polygons
will need to be rendered multiple times if they fall on the edges of the display
tiles. Still, WireGL is a very attractive system for transparently obtaining
large tiled displays for moderate polygon count applications.
SHRIMP
The Princeton University SHRIMP (Scalable High-performance Really
Inexpensive Multi-Processor) project [44] uses the sort-first approach to bal-
ance the load of multiple PC graphical workstations. The screen space is
partitioned into blocks that are assigned to different servers. These blocks do
13
not overlap–they abut. Each rendering server is responsible for the polygons
that fall within the blocks that are assigned to it. If some polygons happen to
fall into multiple blocks owned by different servers, those polygons will need
to be rendered multiple times–once by each server. The SHRIMP project at-
tempts to control communication bandwidth by assigning the blocks to the
same server that is running the display where that block resides. Otherwise,
pixels must be communicated to the correct display server from the rendering
server.
The SHRIMP project suffers from several overhead disadvantages which
are a result of its sort-first architecture. The first is the requirement of non-
overlapping blocks which necessitates rendering the polygons that do overlap
multiple times. Using smaller blocks gives better load balancing, but also
introduces severe overlap penalties. The second is the need to transmit pixels
from rendering servers to the correct display if those blocks are not already
local to the display. The current SHRIMP cluster runs with m rendering
servers on n displays, where m = n. Scaling m >> n would result in this pixel
transfer time growing enormously. Third and finally, and most troublesome
for frame rate considerations, changing user viewpoints can severely upset the
block assignment load balancing. Currently, blocks are assigned to processors
using one of three different load balancing algorithms: grid bucket assignment,
grid bucket union, and kd-split. However, all three share the same problem.
When the user moves or zooms around the scene, polygons move to different
blocks resulting in load imbalance penalties. Transmitting polygons to even
14
the load results in even more time used. For example, a zoomed in scene
could be evenly divided among all the rendering servers. Zooming out might
concentrate all the polygons into a single block, necessitating that they be
reorganized.
Figure 2.1: SHRIMP zoom out timings for horse model
Figure 2.1 shows the results from a SHRIMP project paper during a
zoom in operation on a horse mesh model. Because the experiment is a simple
zoom operation, polygons never have to be transmitted from one machine to
another. A polygon assigned to a certain region will always remain in that
region. The only difference is that the region grows in size. This fact spares
the example from the load imbalance and polygon transmission time penalties.
However, polygon overlap and pixel transmission still cause problems for the
SHRIMP architecture.
15
Even without polygon transmission penalties, from the graph it is easy
to see that user frame rates vary greatly during the operation. At the first
frame, the horse is zoomed out–probably lying in a single display on the tiled
display space. Regions of the horse are rendered by different machines in the
cluster, but pixels from these regions need to be transferred to the machine
that owns that single display. The pixel transfer overhead is clearly evident in
the graph. At the final frame, the horse has been zoomed in until it fills the
entire tiled display. Here, the polygons are much more uniformly distributed
over all the displays. Machines rendering regions of the horse most likely only
need to send their pixels to the local display.
This dissertation will demonstrate how multiresolution techniques, specif-
ically progressive image composition on the Metabuffer, effectively solves the
frame rate variation due to these problems that are evident in the SHRIMP
project, a current state of the art sort first parallel multidisplay rendering
system.
2.3 Sort Middle
In the sort-middle case, the polygon assignment is done in the middle
of the rendering pipeline–after the polygons have been processed to determine
their display coordinates and before they have been rasterized. The main
disadvantage of this technique is that almost all of the polygons need to be
retransmitted between the two steps. This amount of communication makes it
16
unattractive for loosely coupled parallel rendering systems involving clusters of
stand alone machines. However, this is the most common method for dedicated
hardware rendering systems. It is simple, and because these closely knit pieces
of hardware can redistribute the polygons rapidly, it is fast for low numbers
of processing units.
Because this dissertation deals with rendering extremely large data sets
on large, loosely coupled clusters, sort middle will not be discussed further in
this report.
2.4 Sort Last
The sort-last approach is also known as image composition. Each ren-
dering process performs both geometric processing and rasterization indepen-
dent of all other machines in the system. Local images rendered on the render-
ing processes are composited together to form the final image. The sort-last
method makes the load balancing problem easier since screen space constraints
are removed. However, compositing hardware is needed to combine the output
of the various processors into a single correct picture.
Such approaches have been used since the 60’s in single-display systems
[6, 17, 37, 38, 49, 50]. More recent work includes the PixelFlow [12], Sepia
[21], and AIST [40] systems. Multiple display systems, which are the focus of
this dissertation, include Lightning-2 [20] and the Metabuffer [4].
17
2.4.1 Recent Single Display Systems
PixelFlow
The PixelFlow [12] system developed at the University of North Car-
olina is a completely custom piece of hardware. Even the rendering engines are
custom and part of the architecture. This differs from the Sepia, Lightning-2,
and Metabuffer projects which use COTS (Commercial Off The Shelf) graphics
cards in order to render the polygons.
Essentially the PixelFlow architecture chains together rendering boards,
followed by shader boards, followed by a frame buffer board on a high speed
backplane. A parallel host computer provides graphics primitive and shading
information to each board. The boards then take this information and render
the display in 128 by 128 pixel chunks. This is done with the assistance of
a 128 by 128 SIMD processor array located on each rendering board. The
rendering boards also have other coprocessors to do geometry processing and
polygon sorting. The chunks are composited as they go down the backplane,
and then lighting and shading is performed by the shader boards until finally
the finished image is stored in the output frame buffer.
The PixelFlow system is a very powerful architecture. However, its
all-custom design might be a problem with the rapid pace of technology. Al-
though integrating the rendering engines into the architecture certainly pro-
vides a speed advantage, with the swift improvements in COTS graphics cards
this could be considered a drawback. Compositing systems such as Sepia,
18
Lightning-2, and the Metabuffer, which deal only with pixel output from COTS
cards, can adapt easily to newer and better COTS graphics card designs. They
only need to deal with video pixel transmission resolution standards, which
change much more slowly than COTS rendering performance. Provided the
new video card drivers support some manner of Z buffer value extraction,
simply replace the older cards with the latest and greatest. No change in cus-
tom hardware is required. Also, the PixelFlow system was not designed with
multiple displays in mind.
Sepia
One of the more recent cluster based sort-last image compositing sys-
tems is the Sepia project [21]. In a completely opposite tact to the Pix-
elFlow system, the Sepia, except for programmed FPGA chips, relies entirely
on COTS equipment and shuns custom chips and circuit boards. Sepia uses
multiple Compaq Pamette FPGA prototyping boards in conjunction with a
Beowulf cluster and a Compaq ServerNet II network. The Pamette commu-
nicates with the Beowulf cluster and the ServerNet II network using standard
PCI bus interfaces. This setup greatly leverages existing COTS technology.
The Pamette prototyping boards are configured to be pixel merge en-
gines. Pixel merge engines take input from their host PC and composite it (or
perform other mathematical operations) with data arriving from the Server-
Net II network. The output of this operation is then sent over the ServerNet
19
II network to another pixel merge engine on a different computer to form a
computational pipeline. When the data is finally ready to be viewed, it is sent
to a pixel merge engine which relays it to a frame buffer on its host computer
for display.
The Sepia system is intriguing because of its use of standard compo-
nents. Programmed FPGAs are really the only custom hardware needed. This
means that a system can be developed rapidly and for a relatively low cost
compared to custom hardware design. The main disadvantage of the Sepia
system is that it requires image data to be sent to and from host PCs over
the system’s PCI bus. This bus is likely to be already overloaded with data
from the rendering application and is limited by bandwidth. Also, the Sepia
system provides no way to send data from a single rendering server to multiple
pipelines. This limits its possibilities for multidisplay use. Currently the Sepia
team is exploring options to utilize the DVI (Digital Visual Interface) port on
commodity graphics cards to ship digital image data directly off the card and
avoid the PCI bus, similar to what the Metabuffer and Lightning-2 designs
employ.
2.4.2 Recent Multidisplay Systems
Lightning-2
The Lightning-2 system [20] developed by Intel and Stanford is another
recent cluster based entry into the parallel multidisplay rendering arena. It
20
appeared at the same time as the Metabuffer project and shares many basic ar-
chitectural features. Like the Metabuffer, it uses a bus and pipeline crossbar in
order to communicate image data and composite it to form a final display. At
each bus/pipeline connection is a large FPGA which is programmed to choose
pixels from the bus and composite them with data arriving on the pipeline.
Also like the Metabuffer, it employs the DVI port on recently made graphics
cards in order to offload pixel data from the rendering machines without load-
ing down the PCI bus or its system bus. However, unlike the Metabuffer, the
Lightning-2 method used to perform compositing does not allow multiresolu-
tion. The Lightning-2 also does not provide antialiasing support.
Metabuffer
The Metabuffer [4] hardware supports a scalable number of PCs and an
independently scalable number of displays–there is no a priori correspondence
between the number of renderers and the number of displays to be used. It also
allows any renderer to be responsible for any axis-aligned rectangular viewport
within the global display space at each frame. Such viewports can be modified
on a frame-by-frame basis, can overlap the boundaries of display tiles and
each other arbitrarily, and can vary in size up to the size of the global display
space. Thus each machine in the network is given equal access to all parts of
the display space, and the overall screen is treated as a uniform display space,
that is, as though it were driven via a single, large frame buffer, hence the
name Metabuffer.
21
Because the viewports can vary in size, the system supports multi-
resolution rendering, for instance allowing a single machine to render a back-
ground at low resolution while other machines render foreground objects at
much higher resolution. Also, because the Metabuffer supports supersampling,
antialiasing is possible as well as transparency using the screen door method.
2.5 Discussion
It was decided to design the Metabuffer as a sort last system because
of the inherent flexibility the method allows for load balancing. For example,
because they are sort-last systems, neither the Sepia, Lightning-2, or Meta-
buffer devices incur any of the polygon overlap penalties evident with the
SHRIMP project. Regions may overlap each other, so there is no reason to
render a polygon twice, provided the polygon is not zoomed in to be so large
as to completely exceed the bounds of a viewport. Also, there is no pixel
transmission overhead associated with the Lightning-2 and Metabuffer sort
last systems. The architectures are designed to efficiently shuttle pixels from
renderer to any display in the global display space. Compare this to SHRIMP
when pixel transmission penalties occur whenever the local display is not used.
SHRIMP, Sepia, and Lightning-2 all do share two common problems,
though. The first is changing user viewpoints. As discussed before with
SHRIMP, changing the user’s viewpoint, either by rotating the data set, zoom-
ing it, or looking at a different area, will likely cause polygons to fall into and
22
out of the screen regions that the renderering machines have been assigned. In
the best case, this will simply cause a load imbalance resulting in an inefficient
use of the rendering resources. In the worst case, the machine may not be able
to cover all of the polygons it is assigned and certain polygons may not be
able to be rendered at all unless they can be transmitted to another machine
immediately. This double edged sword results in time penalties both for load
imbalance and for transmission over the network to move polygons from one
rendering machine to another.
The second problem all share is limited resource allocation flexibility.
Just like SHRIMP, if the devices are driving a very, very large display, ren-
dering that display is an all or nothing event. The entire display is rendered
in high resolution. Typically the user (or users) looking at the display may
only be studying a certain small area. The unviewed regions are wasted. Good
examples of this are CAVE [7] type virtual reality configurations. Only a small
part of the cave is viewed at any one time. Ideally, the majority of rendering
resources should be concentrated only where the users are looking. This will
improve the frame rate of the application and thus increase user responsive-
ness.
The Metabuffer attempts to solve these two issues by including mul-
tiresolution support. This allows for the progressive image composition and
foveated vision techniques that are discussed later in this dissertation. The
Metabuffer also has several other unique features not duplicated in the simi-
23
lar Lightning-2 architecture, namely antialiasing and transparency using the
screen door method in conjunction with pixel replication. These will be dis-
cussed in the architecture section.
24
Chapter 3
Metabuffer Architecture
3.1 Metabuffer Architecture
The architecture of the Metabuffer presents a number of challenges.
The most difficult problem is the large amount of data that must be processed.
Each pixel needs RGB color, Z order, and alpha information. A single frame
will have millions of pixels. A real-time rendered animation should display
approximately 30 frames per second in order to be fluid and smooth. Multiply
all of this by several rendering engines and several output displays and the
large quantities of data involved are clearly evident.
Figure 3.1 shows how a Metabuffer architecture using three rendering
engines and four output displays utilizes multiple pipelined data paths and
busses to surmount this problem. External to the board, COTS (Commer-
cial Off-the-Shelf) rendering engines (A) deliver their data to on-board frame
buffers (B) by means of the recently adopted industry standards for digital
video transmission, the Digital Visual Interface (DVI). Since COTS rendering
engines (A), at this time, transfer only 24 bits per pixel over these digital links,
color is transferred on even frames, while alpha and Z information is trans-
25
C C C C
C C C C
C C C C
Compositing Unit
Meta-Buffer
Display
B
B
B
A
A
A
Rendering Engine Frame Buffer
� � �� � �
� � �� � �
PC Workstations
Figure 3.1: Metabuffer architecture
26
ferred on odd frames. At a refresh rate of 60 hertz, this is still fast enough
to provide enough RGB, alpha and Z information for 30 frames per second.
The on-board frame buffer (B) stores information from both transmissions in
memory. Control information, such as the location of the viewports and their
final destination in the overall display, is stored on the first scan line of each
rendering engine’s image (A). This first scan line is never displayed. Instead,
DSP code, viewport data, or anything else that is needed by the control logic
of the frame buffer can be written here using standard OpenGL glDrawPixels()
calls.
When a full frame has been buffered, data is selectively sent over a
wide bus to the composer units (C) based on viewport locations. The com-
posers (C) take only the data that is required to build their column’s output
image and ignore the rest. Each composer (C) then sends its data in pipeline
fashion down the column to the next lower composer (C) so that the pixel Z
order information can be compared with those Z values from the other COTS
renderers (A). This way, only the front-most pixel is saved. The collaged data
is then stored on another on-board frame buffer. These smart frame buffers
can perform post processing on the data for anti-aliasing and are also able to
drive the off-board displays again using the DVI specification.
27
3.2 Bus Dataflow
Encoded at the start of each rendering engine’s image is control infor-
mation that tells the input frame buffer which segments of the image should
be sent to which composers and where they should be placed in the final dis-
play. This work is done by the computer hosting the rendering engine since it
offloads the computational work to a full fledged CPU, which is more suited
to this task than the streamlined Metabuffer. The control information is sent
in tabular form, with one row corresponding to each image segment.
Dcomp Sx Sy Sdx Sdy Dx Dy Dmultiple Transparent1 0 0 75 75 25 25 1 1002 75 0 25 75 0 25 1 1003 0 75 75 25 25 0 1 1004 75 75 25 25 0 0 1 100
Table 3.1: Viewport control information
Table 3.1 shows some typical data describing a viewport configuration
(essentially the layout as described in section 3.2.1 later in this paper). Here,
the image and display size are assumed to be 100 pixels by 100 pixels. Dcomp
is the index number of the composer (or display) where the segment is to be
sent. Sx and Sy refer to the source coordinates of the segment in the rendered
image. Sdx and Sdy refer to the dimensions of the segment in the source
image. Dx and Dy refer to the destination coordinates in the display image.
Dmultiple is the replication factor of the source pixel. Since the ratio of source
to destination pixels is 1:1, this multiple is 1. Transparent refers to the special
28
patterns that are applied to pixel replication operations in order to provide
for screen door transparency. 100 means that the viewports are opaque. The
input frame buffer broadcasts the entire viewport table over the bus to the
composers at the start of each frame. Each composer then takes the entry
that it is responsible for and stores it locally.
3.2.1 Analysis of Bus Data Flow
One of the most interesting problems of this project is how to efficiently
transmit image data from the input frame buffers, through the bus, and then
to each composer. Since the composers are arranged in a pipeline fashion,
it is imperative that they have the data they need at the right time. If one
composer is missing its data, a glitch in the image will occur.
Since the Metabuffer employs viewports of varying size and position, it
is important to demonstrate that the bandwidth requirements of the composers
will not exceed the limited data rate of the bus that connects them to the input
frame buffers. If the bandwidth requirements are exceeded in certain viewport
configurations, glitches in the output image are certain to occur. The analysis
that follows proves that the Metabuffer has a constant bandwidth requirement
regardless of the size or orientation of the viewports that are used.
In order to analyze the worst case data flow of the board, a scheme
is used similar to the one presented in the paper by Kettler, Lehoczky, and
Strosnider [27]. Since all data needs are periodic (because of the raster display),
29
each task (display) can be described in terms of the amount of data needed
(C), its period (T), and its deadline (D). By quantifying these values for some
sample cases, it is easy to see that the bandwidth requirements do not change
as the viewport geometry becomes more complex.
For example, if we assume that the smallest viewport is the size of an
output screen (of w by w pixels), and that the viewports increase in size in
even multiples, observations for the following cases hold true.
Case One
1 1
Figure 3.2: Case one: single screen viewport
In figure 3.2 the input image is the same size as an output screen, but
only one composer is used. The ratio of pixels from input to output is 1:1, so
the composer requires a steady stream of data. As shown on the right, the
total bandwidth required is one screen full.
Data Period DeadlineC1 = w T1 = w D1 = w
Table 3.2: Case one: bandwidth analysis
This is the trivial case. Table 3.2 demonstrates that the data needed
(C) is equal to the period for the scheduling. A steady stream of data will
30
satisfy this.
Case Two
1 1
11
1
Figure 3.3: Case two: four screen viewport
Again, the input image in figure 3.3 is the same size as an output screen.
However, in this case four different composers require data. But, according to
the geometry of the display, only one composer will need data at any particular
time. As shown on the right, none of the composer viewport areas overlap.
They join together to form exactly one screen size. So, one screen size of data
is needed. The ratio of pixels from input to output is 1:1, and there is no
overlap, meaning only one pixel need be accessed on the bus at any one time.
Data Period DeadlineC1a = l T1a = w D1a = wC1b = w − l T1b = w D1b = w
Table 3.3: Case two: bandwidth analysis
The variable l in table 3.3 represents the vertical dividing line in the
row between tasks 1a and 1b. For the purposes of scheduling, the horizontal
divider is ignored, since this merely changes the display destination of the
data, and not the data timing needs of the system. Adding all of the data
31
values together (C) results in the same quantity as the period, which means
the bandwidth is constant compared to the previous case.
Case Three
1 2
3 4
1 2
3 4
Figure 3.4: Case three: four screen low resolution viewport
In figure 3.4, the input image is four times as large in order to form
a low resolution background display. In this case four composers will require
data, but they will all require data at the same time! As shown on the right,
four screen-fulls of data are required. However, the solution here is that the
ratio of input pixels to output pixels is 1:4. Thus, while four times the screens
are being created, they are being furnished with one fourth of the data. This
effectively means that the bandwidth requirements here are still constant. The
fact that four composers require pixel data at the same time is a problem,
but since the bandwidth requirements are scalable, a simple buffering scheme
should satisfy each of the composers.
Table 3.4 displays the results of this operation. Because pixels are being
replicated to twice their size, the period (T) of the scheduling increases by a
factor of two because there are half as many rows to process. Likewise, the
32
Data Period DeadlineC1 = w/2 T1 = 2w D1 = 2wC2 = w/2 T2 = 2w D2 = 2wC3 = w/2 T3 = 2w D3 = 2wC4 = w/2 T4 = 2w D4 = 2w
Table 3.4: Case three: bandwidth analysis
data needed (C) decreases by a factor of two. If all of the C values are totaled,
the result is 2w, which is the same as the period.
Case Four
1 1
1 1
1
2
2
2
3 3
3
4
4
Figure 3.5: Case four: nine screen low resolution viewport
Finally, as shown in figure 3.5, the input image is again four times as
large, but now it overlaps nine composers. From the right, it can be seen that
from these nine composers, only four screens simultaneously need to be placed
on the bus at the same time. And, from the analysis of case 3 in table 3.4, be-
cause the ratio of pixels is 1:4, there is one-fourth the bandwidth requirement.
Again, the bandwidth requirements remain constant. Since four composers
must simultaneously have data, the bus must be buffered. Successive cases of
33
larger viewports and more composers can be extrapolated in a similar manner.
Data Period DeadlineC1a = l/2 T1a = 2w D1a = 2wC1b = (w − l)/2 T1b = 2w D1b = 2wC2 = w/2 T2 = 2w D2 = 2wC3a = l/2 T3a = 2w D3a = 2wC3b = (w − l)/2 T3b = 2w D3b = 2wC4 = w/2 T4 = 2w D4 = 2w
Table 3.5: Case four: bandwidth analysis
As shown in table 3.5, because pixels are being replicated to twice their
size, the period (T) of the scheduling increases by a factor of two because there
are half as many rows to process. Likewise, the data needed (C) decreases by
a factor of two. If all of the C values are totaled, the result is 2w, which is the
same as the period.
3.2.2 Buffering of Bus Data Flow
As stated before, supplying a local buffer on each composer is neces-
sary to allow for simultaneous access of the image data. It also provides the
capability to do multiresolution pixel replication. The buffer that each com-
poser maintains closely resembles a queue, except for one important difference.
While the buffer acts in a FIFO manner when Dmultiple is 1 (the source pix-
els and destination pixels are in a 1:1 ratio), if pixel replication needs to be
done, it is necessary to remember data from the previous row. If advanced
smoothing is being performed then multiple rows may be needed. Therefore,
34
the cache behaves like a queue, but also has a moving window of data that
always stores the previous source row of at least size Sdx.
3.3 IRSA Round Robin Bus Scheduling
In order to send data to the composers in a simple, yet good performing
manner, an idle recovery slot allocation (IRSA) round robin approach [27] is
employed which distributes data to the composers evenly based on the amount
of data needed (C), the period (T), and the deadline (D). No effort is made to
look ahead in the geometry of the viewports to find the most efficient way to
send the data out. However, because of the previous discussion, the uniformity
of the data transmitted to each buffer will result in few delays using this simple
method.
In the event that a composer-side buffer becomes too full to cope with
the data, the round robin scheduler performs an idle slot recovery operation.
The composer receiving data drops a bit defined as BUSREADY on the bus
for one clock cycle. Once the input frame buffer reads the low BUSREADY
bit, it stops sending data to that composer and jumps to the next scheduled
segment in the table. This way other composers can utilize the unused time
on the bus. The scope of the BUSREADY bit will be limited by the fanout of
the bus, but this is true of the bus in general, and the low number of displays
typically used should not cause a problem here.
35
3.4 Sequence of Metabuffer Operations
For each frame, the Metabuffer follows a sequence of steps in order to
compute the final collaged output display. In order to synchronize themselves,
the pipeline composers and output frame buffer employ a PIPEREADY bit to
communicate with each other. The details of this method follow below:
1. Frame Transition: Input frame buffers finish the previous frame, switch
to next frame, and start feeding data to the composers.
2. Waiting for PIPEREADY: At this stage, composers have not re-
ceived a PIPEREADY bit bubbling up from the composers in the pipeline
below, but accept data until their internal buffers are entirely full with-
out transmitting any data for this frame (though the previous frame
could still be in computation) down the pipe.
3. Buffers Are Filled: When the internal buffers of the composers become
full, each drops the BUSREADY bit on each transmission request from
the input frame buffers, effectively stalling the Metabuffer.
4. Output Frame buffers Signal Completion: When the output frame
buffers realize that they have finished building the old frame, they switch
to a new frame and send a high PIPEREADY bit to the previous com-
poser.
5. Composer Relays Finish Signal: When a composer gets a PIPEREADY
bit from the following composer (or output frame buffer), it checks to
36
see if its internal buffer is fully prefetched solely with the data from the
new frame (all data from the old frame has been cleared out). If so, it
relays the PIPEREADY bit to the previous composer in the pipeline. If
not, it stalls until it is entirely prefetched.
6. Master Composer Signals Start of Frame: Once the PIPEREADY
bit gets to the master composer (the composer at the top of the pipeline),
and the master composer is ready, everything is set for that pipeline
to begin computation of the next frame. The master composer starts
the frame by sending a STARTFRAME bit down the pipeline and then
streaming out data.
7. Composers in Pipe Begin Frame: The other composers in the
pipeline, once they read the STARTFRAME bit, relay that bit down
the pipeline and begin their computation. The STARTFRAME bit is
important because it automatically establishes each composer’s position
on the pipeline (since each successive composer must be offset one cycle
to be synchronized). Only the head composer at the top of the pipeline
needs to be initialized via a PIPEMASTER bit set via a jumper when
the circuit board is installed.
8. Input Frame buffer Streams Out Data: Now that the pipeline
is started and data is flowing, the input frame buffer will no longer
get BUSREADY low bits, and can resume streaming data out to the
computing composers in a round robin fashion.
37
Now that data is flowing through the busses and the pipeline, each
composer, using an internal index of the output display, determines if the seg-
ment it is responsible for intersects the current coordinates. If so, it attempts
to fetch the proper pixel information from the cache and compares it to the
Z value of the previous pixel in the pipeline. Once an entire display has been
sent to the output frame buffer, the process repeats itself.
3.5 Conclusion
The Metabuffer provides for leveraging today’s commodity PC technol-
ogy to construct cost-effective, parallel high-end graphics rendering systems
with multidisplay capability. It has the advantages of easing load balancing
by providing a uniform display space abstraction to the software, supporting
multiresolution and foveated display, and providing a scalable platform with
no changes to stock hardware. It does require the development of non-trivial
custom hardware to perform image compositing. However, a parallel effort
at Stanford University has been able to design hardware that can support a
version of this type of image compositing [20]. Fortunately, most of this work
can be done without resorting to custom VLSI, at least for prototypes.
The Metabuffer can also hope to avoid the fate of so many parallel
architecture projects in the past, in which the development of custom switch-
ing hardware took so long that the advantages of parallel computation were
swamped by the rapid development of commodity semiconductor technology.
38
This is not only through avoiding using custom silicon, but also because the
hardware is designed to handle video standards, which change more slowly
than processor and system clock speeds. A Metabuffer system will be usable
with many future generations of processors, even with a slower development
cycle.
39
Chapter 4
Metabuffer Simulator
4.1 Introduction
Because of the complexity of the Metabuffer a prototype of it has been
built in software. This prototype is modeled as closely as possible to the oper-
ation of the Metabuffer architecture discussed previously in this paper. Since
this software prototype will be the basis for the first hardware implementation
of the Metabuffer, all coding was done strictly with the Metabuffer architecture
in mind.
By building the prototype in software first, it is possible to do much
more extensive testing and to try many more design alternatives in the same
amount of time than with hardware. Changing a signal or reworking an al-
gorithm means only recompiling the source code, instead of rewiring a circuit
board or burning another FPGA. Also, with a software prototype, a Metabuf-
fer consisting of hundreds or thousands of rendering engines can be simulated.
Building a prototype Metabuffer of that size in hardware would require an
enormous amount of resources.
Although the software prototype cannot operate in real time, it can be
40
used to thoroughly simulate the operations of the Metabuffer. Just about any
aspect of the design can be programmed and evaluated. New algorithms can
be tested on the prototype just as if they were encoded into a DSP. Likewise,
applications that use the Metabuffer can be tested at an early stage with
the software prototype to solve design issues, taking into account that the
final hardware version of the Metabuffer will offer more performance, while
operating the same.
4.2 Implementation
The Metabuffer software prototype was completed in C++ since the
highly modular design concept lends itself to the use of object oriented pro-
gramming. Each module (input frame buffer, composer, and output frame
buffer) is defined as a separate C++ class. The data hiding capabilities of
object oriented programming mean that it is possible to create a large Meta-
buffer with possibly thousands of composers simply by replicating one class
over and over again. Also, once the class is defined, changing the layout of the
Metabuffer simply means adjusting the number of frame buffers and composers
being used via the creation or deletion of class instances.
Each class used in the Metabuffer simulator runs in its own pthread. All
the classes are synchronized by a global clock. In hardware, this clock would
be a signal on the bus. In software, the high to low and low to high clock
transitions are implemented by a barrier written using pthread primitives.
41
CClock
CInFrameBus
CInFrameBus
CComposerPipe CComposerPipe
CComposerPipe CComposerPipe
COutFrame COutFrame
Figure 4.1: Simulator class instance organization
These barrier calls are placed in a separate class called CClock and is referenced
by all the other components in the system. A diagram showing both the
layout and the dependencies of the class instances for a Metabuffer simulator
consisting of two renderers and two displays is shown in figure 4.1. In the
appendix A each of the classes shown is fully documented.
4.3 Multiresolution Output
In order to test multiresolution support of the software prototype of
the Metabuffer, it was necessary to obtain a source of rendered images and Z
order values. Eventually this data will come from the digital output of COTS
rendering engines. For these particular tests, images and Z order values were
generated using the Rayshade ray tracer. Reading an image in TIF format
42
and the Rayshade generated Z order information into the input frame buffer
class simulates the transmission of a frame of RGB data and a frame of Z order
data from the rendering engine.
Figure 4.2: Rayshade generated input images with viewport configuration
The images in figure 4.2 show the TIF images that were rendered using
Rayshade: a ball, a tube, and finally a seascape. The final diagram illus-
trates how these images were distributed to the four output displays by being
broken up into viewports. Note that every image is sent to at least two out-
put displays. As discussed earlier in this paper, the location and geometry of
the viewports is arbitrary. The bandwidth requirements over the bus remains
constant.
Running the three images into a three input frame buffer by four output
frame buffer Metabuffer yields the four output screens in figure 4.3. Note that
43
Figure 4.3: Composited simulator output images
the tube resides in four separate displays, despite being rendered on a single
machine. Also, see how the seascape here is being used as a low resolution
background display with the higher resolution foreground images layered on
top. Finally, the Z order of the input images is always taken into account,
whether that means that the ball is in front of the tube, or that the ocean
surface laps at the base of the foreground objects.
4.4 Antialiasing Output
One problem with compositing separate images like the ones above is
the aliasing that results on the edges. A solution that has been implemented
involves supersampling. Simply increasing the detail of the input images and
then having the output frame buffers average the pixel values down to the
44
original size effectively smooths the image. Only the problem pixels at the
edges are affected. The rest of the composited image pixels remain as sharp
as on the original.
This technique is commonly used in graphics cards to antialias displays.
It is extremely simple, since the only major change to the graphics pipeline,
besides the increase in resolution, is an averaging step at the very end. The
main disadvantage is the fact that the graphics hardware has to run so much
faster in order to generate the extra pixels. This is not much of an issue inside
the tightly coupled hardware of a graphics card. In a more loosely coupled
system like a cluster these heightened bandwidth requirements could be a
problem. But, even with the bandwidth concerns, supersampling has been
implemented in PixelFlow, another sort last system similar to the Metabuffer.
Figure 4.4: Zoomed image without (left) and with (right) antialiasing
The two images generated by the Metabuffer in figure 4.4 (magnified
eight times to show the difference in detail) demonstrate the effect supersam-
pling has on the resulting image quality. On the left, no supersampling has
been performed. There is a jagged transition between the different input im-
45
ages at the Z buffer transition. On the right, the input images were rendered
to be four times as detailed and the final output pixels were averaged by the
output frame buffer from the four nearest pixels that traveled through the
composer pipeline. The jagged transition is now much smoother while the rest
of the image has lost no quality.
4.5 Transparency Output
A major issue for sort last parallel rendering systems is transparency.
In sort first systems, a region in the display space is assigned to a single com-
puter. That machine can easily make the calculations necessary to create
transparency in that single area. With last systems, though, many machines
may be contributing polygons to form a single region in the display space.
Some of those polygons could be opaque and some could be transparent. Poly-
gons could be of varying depth on different machines resulting in interleaving.
Also, polygons are seldom sorted back to front in the compositing chain. This
dissertation discusses three different methods used to create transparency on
sort last systems: interpolated transparency, multipass, and screen door. It
includes the reasoning for using the screen door implementation on the Meta-
buffer and gives examples of its output.
46
4.5.1 Interpolated Transparency
Interpolated transparency is represented by the equation 4.1 as stated
by Foley [15].
Iλ = (1− kt1)Iλ1 + kt1Iλ2 (4.1)
The transmission coefficient kt1 measures the transparency of the poly-
gon in the foreground. The final pixel color is achieved by using this coefficient
to linearly interpolate the color contribution of the polygon in the background,
Iλ2, with the color of the transparent polygon in the foreground, Iλ1.
The primary problem with interpolated transparency as it relates to
sort last systems is that it is not commutative. For the technique to work
properly, polygons must be correctly sorted from back to front. Typically sort
last systems allow interleaving of polygons belonging to multiple machines.
This interleaving information is many times lost by the time the viewport of
the machine is rendered. Only the topmost polygons and Z values remain.
Therefore, strict rules regarding the grouping of polygons must be followed
for it to work on a sort last system. These restrictions destroy much of the
flexibility sort last systems give for load balancing.
With the Metabuffer system, another concern is pipeline ordering. Poly-
gons need to be sorted from back to front. That means that distant polygons
47
must be at the head of the pipeline and the closest polygons should be at the
tail. If the user were to rotate the data set 180 degrees, almost the entire data
set would need to be reshuffled to comply with the sorting assertion.
The Sepia system, however, is excellently equipped to deal with these
issues. As mentioned previously, Sepia uses ServerNet II to form its composit-
ing pipeline. ServerNet II has the advantage that it can be reconfigured on the
fly to change the routes that packets take within the system. A compositing
pipeline can be reordered upside down simply by changing the ServerNet II
routes.
This is the method employed to render volumes on the Sepia system
[31]. A cubed data set is subdivided into 8 pieces. These pieces are rendered
separately and then blended together using the Sepia system. Depending on
the user’s viewpoint, the ServerNet II network adapts to put the pieces in the
correct back to front ordering. Because the pieces do not overlap and have no
interleaved polygons, changing the compositing routes is sufficient to satisfy
back to front sorting. A similar method is employed by Muraki on an image
compositing system using a prioritized binary tree method [40].
4.5.2 Multipass Methods
Mammen [34] describes a method to render transparency in multiple
passes. His technique removed the need to sort the polygons from back to front,
but does introduce more complexity due to requiring multiple steps. After all
48
of the opaque polygons are rendered to a Z buffer, the algorithm goes through
an iterative process to determine which of the transparent polygons is furthest
back but still visible. The transparent effects of that polygon is contributed to
the rendering and the process is repeated until all of the transparent polygons
have been taken into account.
Multipass transparency is slow and complex, but yields excellent result.
PixelFlow uses this technique, but employs a special library to isolate the
programmer from the difficulties of implementing the operation.
4.5.3 Screen Door
Just as the name implies, with the screen door method of transparency
instead of treating polygons as transparent, they are simply rendered with a
portion of the pixels dropped to allow the background to show through. The
more pixels dropped, the more transparent the polygon appears. Because the
screen door effect is fully recorded by the Z buffer, this technique is nether
dependent on compositing pipeline ordering nor polygon sorting order making
it ideal for sort last architectures.
4.5.4 Metabuffer Implementation
Screen door was chosen for the Metabuffer primarily because of the
flexibility it gives regarding the ordering of the compositing pipeline. Unlike
Sepia, with its configurable ServerNet II network, the Metabuffer’s pipeline
49
is fixed in hardware. But since the screen door algorithm requires no poly-
gon sorting, changing user viewpoints will not require shuffling the data set
and thus will not adversely affect the frame rate. Another advantage is that
the Metabuffer system already uses pixel replication for multiresolution and
employs supersampling for antialiasing. This abundance of redundant pixels
makes it quite easy to create screen door masks without affecting the quality
of the image. For instance, on non-supersampled viewports, each pixel is repli-
cated four times and then averaged down to one pixel on the final display. By
employing a simple checkerboard mask on the replicated pixels, the averaged
output pixel correctly achieves a 50% transmission coefficient. An example
using this method is shown in figure 4.5.
Figure 4.5: Screen door transparency Metabuffer output
50
The screen door technique is not without its problems. Because the
Metabuffer only employs 4x supersampling, transparency can only be quan-
tized into four levels. Also, if multiple transparent layers of polygons overlap,
the screen door patterns may interfere with each other creating undesirable
effects. Figure 4.6 is a zoom of figure 4.5 showing how the ball completely
obscures the tube behind it as a result of these mask collisions. In addition,
performing the screen door mask on replicated pixels will produce problems if
polygons from different machines interleave, since only the front-most Z val-
ues for each machine’s viewport are recorded. However, if these limitations are
taken into account, screen door is an adequate way to achieve transparency.
Figure 4.6: Zoom of transparency example
4.6 Distribution
The Metabuffer simulator included in the distribution has been tested
and run primarily on Windows NT. However, it has been ported to IRIX and
51
should run on any system that has a pthreads compliant library installed.
The Metabuffer simulator distribution consists of three main parts.
The first is the actual source code for the component classes. Included here
is code for supporting classes that form wrappers around the synchronization
primitives. This helps to make the code more cross platform if another thread
library is used instead of pthreads.
Also included in the distribution is a Windows implementation of the
pthreads library [26]. Windows has its own threading model and does not
implement the pthreads standard. Normally, this would be fine and changing
the synchronization classes to Windows functions would port the code. How-
ever, the clock emulation relies on a barrier class which is formed by using
conditional variables. Windows does not support conditional variables. The
pthreads library included here for Windows implements conditional variables.
This is not a trivial task and actually requires timeout parameters to pre-
vent deadlock. By using the pthreads library instead of the native Windows
threading model, barriers can be correctly implemented.
Finally, a version of the libtiff library [30] is included in the distribution
for reading and writing images. Source images generated by Rayshade are read
into the Metabuffer simulator as TIFs. Likewise, output images generated by
the Metabuffer simulator COutFrame classes are written as TIF files.
52
4.7 Conclusion
The Metabuffer simulator provides a valuable testbed for testing image
compositing ideas at the granularity of the bus clock. By using the Metabuffer
with test images in numerous different viewport combinations, it proves that
the Metabuffer can generate glitch free output images and thus shows that
bandwidth requirements are constant no matter what the viewport arrange-
ment is for the scene.
53
Chapter 5
Metabuffer Emulator
5.1 Introduction
In order to provide an interactive testbed for writing applications for
the Metabuffer system, an emulator was written in software that would mimic
the operations of the Metabuffer while attempting to run as fast as possible.
The Metabuffer emulator essentially produces the same output as the hard-
ware level simulator, except it is not constrained to work as the Metabuffer
hardware would. Thus it can be optimized to run as fast as possible on the
host architecture.
The host system for the Metabuffer emulator is a Beowulf cluster con-
sisting of 128 networked Compaq computers running the Linux OS. Each ma-
chine contains an 800 MHz Intel Pentium III 256K L2 processor and 256 MB
RDRAM. 32 of these machines are equipped with high performance Hercules
3D Prophet II GTS 64 megabyte DVI graphics cards. Furthermore, 10 of these
graphics cards are linked to a 5 by 2 tiled projection screen display in the UT
visualization lab.
The Metabuffer emulator uses MPI for communication on the cluster
54
and a slightly modified version of the GLUT [28] library for doing all graphics
rendering and display. Instead of sending image data out of the DVI port to
the Metabuffer hardware, the Metabuffer emulator reads back the pixel infor-
mation from graphics cards belonging to the Beowulf machines using OpenGL
glReadPixels() calls to the GLUT window. This image data is then sent over
the network (instead of through the Metabuffer I/O lines) via MPI and com-
posited by other machines (instead of using the Metabuffer pipeline) in the
Beowulf cluster. These compositing machines also display the final images on
the projection screen display again using the GLUT library.
5.2 Implementation
5.2.1 Granularity
The primary reason that the Metabuffer emulator is faster than the
Metabuffer simulator is granularity. The Metabuffer emulator uses the MPICH
library for communicating data between the machines in the Beowulf cluster.
In the case of the simulator, each component is synchronized with the other
via a global bus clock. No matter how the workload is divided, the machines
doing the processing still have to synchronize themselves to this clock. As a
result of this fine level of granularity, millions of synchronizations are needed–
one for each pixel. For example, a version of the Metabuffer simulator ported
to use MPI required five minutes to complete a single frame.
The Metabuffer emulator, on the other hand, performs all of the work
55
at the granularity level of the frame. It disposes of the CComposerPipe code
used to process the pipeline pixel by pixel and instead sends whole buffers
of image data directly from the CInFrameBus renderers to the COutFrame
machines which now are responsible for both compositing and displaying the
output.
COutFrame COutFrame
CInFrameBusCInFrameBus
Figure 5.1: Emulator class instance organization
Figure 5.1 shows the class instance dependencies for a Metabuffer em-
ulator consisting of two renderers and two displays. Each renderer sends one
message to every display in the system. This message contains whatever image
fragment it is contributing to that display. The displays receive all the image
fragments and piece them back together again to form the final image.
Looking at the cross hatching of messages from CInFrameBus renderers
to COutFrame displays immediately reveals that the Metabuffer architecture
does not map well to a common PC cluster. Each renderer must communicate
with all the displays in the system, which can result in very high communica-
tion requirements. Even more problematic, if several rendering machines send
all their data to one display machine, that display machine will be severely
56
overloaded with compositing duties. As a result, the Metabuffer emulator
running on the Beowulf cluster is not scalable to a large number of machines.
The Metabuffer hardware solves all these issues by using high band-
width parallel I/O and compositing pipelines consisting of many compositing
processors. COTS PCs connected in a cluster exhibit none of these quali-
ties. The bandwidth of the communications network is limited. Also, though
each machine does have a very powerful processor on board, it cannot match
the efficiency of multiple smaller processing blocks in a pipeline arrangement.
Still, even though limited in the number of machines that can be used, the
Metabuffer emulator does achieve interactive frame rates for exploring new
applications for the Metabuffer hardware.
5.2.2 MPI Mapping
One of the biggest issues with writing the Metabuffer emulator was
mapping certain machines to specific MPI processes. Each COutFrame com-
ponent needed to be running on a specific cluster machine that was connected
to a specific display. Otherwise the tiling could not work.
Computation clusters consisting of PC workstations are seldom equipped
with graphics cards and even more rarely are they connected to graphics dis-
plays. Usually each machine is the same as any other. MPI assumes this and
does not offer any way to bind certain machines to certain processes.
57
In order to overcome this, during initialization the Metabuffer emula-
tor performs an all-to-all broadcast of each MPI process’ machine name. With
this information, each process in the Metabuffer emulator dynamically deter-
mines its role. Processes that are connected to displays automatically use the
COutFrame code and assume the correct position in the tiling. Processes that
are not connected to displays use the CInFrameBus code and also calculate on
which machines the displays are located in order to send their image fragments.
5.2.3 Plugin API
Programming the emulator consists of writing just three functions which
are then linked into the existing code.
InitRenderer(): This function is called at the initialization of the emulator.
It passes to the user code the renderer number (0 to NUMINPUTS-1),
an MPI communicator containing all the renderers in the system (for use
in load balancing operations), and the argc and argv parameters passed
in from the mpirun command line.
GetRendererData(): This function is called at the start of every frame.
The location and resolution of the viewport is requested, along with the
RGB and Z data contained in the renderer’s viewport.
UpdateRenderer(): Since the MPICH implementation does not currently
support multithreaded processes, this function is called multiple times
58
during the image compositing to allow user code to process any message
queues or do other housekeeping tasks. If not needed it can be set to an
empty function.
The two multiresolution techniques discussed in this dissertation were
both coded as plugins for the Metabuffer emulator. The advantage of splitting
the application code from the emulation code with this strict API is that these
Metabuffer emulator plugins can then very easily be made into applications
that interact with the actual Metabuffer hardware. The only requirement
would be to replace the Metabuffer emulator code with the code required to
interact with the hardware. The very same plugin API could still be used.
5.3 Distribution
Although the code in this distribution has been tested only on Linux
clusters, it should be portable to just about any OS. The emulator relies heav-
ily on the MPI, GLUT, TIFF, and OCview libraries, all of which have been
compiled for many different operating systems. The actual Metabuffer emula-
tor code should be very cross platform.
Again, for reference, the cluster here at UT consists of 32 Linux ma-
chines equipped with Hercules 3D Prophet II GTS graphics cards. 10 of these
machines are connected to a 5 x 2 tiled projection display in the visualization
laboratory. The other 22 graphics cards are used only for rendering.
59
The Metabuffer emulator uses MPI to communicate between the ma-
chines and OpenGL to work with the graphics cards. A typical session has
10 of the machines rendering polygons and 10 others doing the Z depth com-
positing and ultimately displaying the graphics on the projectors.
The software should run on any Linux based cluster that has some
version of MPI (practically standard on most computing clusters) and runs
XWindows with support for OpenGL. Appendix B includes more details on
creating the emulator executable.
5.3.1 Plugins
Three plugins are in this distribution. They are located in the meta/plugin
directory. To change plugins, simply copy them to the meta/emu directory
and change their name to plugin.cpp. The distribution initially has teapot.cpp
as the plugin.cpp file.
1. teapot.cpp The famous Utah teapot bounces around the tiled display.
This plugin is the simplest because it does not use the OCview rendering
library and doesn’t need the metadata part of the distribution.
2. ducksetal.cpp Similar to the teapot, but instead of teapots, small
OCview objects move around the screen.
3. progressive.cpp This plugin is the progressive image composition ex-
ample. A 9.2 million triangle isosurface extraction of the visible human
60
data set is split into 10 pieces of 920,000 triangles. Each piece is ren-
dered by a different machine and then composited together to form the
entire image. The pieces are first cycled in a circle to show they are
individual and can move anywhere in the display, then they are put to-
gether, zoomed, and rotated. The resolutions of the parts change if their
triangles cannot fit within high-resolution viewports. This way no poly-
gon or pixel information needs to be communicated between machines
and frame rates remain constant. The plugin does not contain code to
rebalance the triangles in order to regain high-resolution viewports for
different views. Editing the plugin.cpp source code and changing the
DATASET #define allows either the VIZHUMAN, SANTABARBARA,
or OCEAN data sets to be viewed.
4. fovea.cpp For the foveated vision plugin, the renderers are assigned ar-
eas of the screen according to where the user is currently gazing. The
majority of renderers draw the region where the user is focused. At the
same time, the minority of renderers concentrate on drawing the periph-
ery. This smaller number of processors can render the larger area because
they are working in low resolution. Since human peripheral vision lacks
detail there is no reason to render this area with as much acuity as where
the user is focused. Using the same argument, these renderers also deal
with decimated data sets to reduce their polygon counts to manageable
sizes. Again, the periphery is not sensitive to this loss of detail. The
result of this is that the user is presented with a high resolution region of
61
interest and a constant frame rate, no matter what viewpoint is chosen.
By editing the plugin source and changing the DATASET #define the
user can view either the VIZHUMAN, SKELETON, or ENGINE data
sets.
Writing a custom plugin simply means creating the three functions
specified by plugin.h in a plugin.cpp file and linking it in. A plugin does not
have to use the GLUT library or OCview. It can use anything that will provide
a source of RGB and Z information.
5.3.2 Future Work
Unlike the Metabuffer hardware simulator, currently the Metabuffer
emulator does not support supersampling. This means neither antialiased
supersampled viewports are possible nor is screen door transparency.
5.3.3 Undocumented Features
This emulator is constantly evolving and there are several useful fea-
tures buried in the code that might be useful for other developers. In CIn-
FrameBus.cpp the #define SHOWVIEWPORT turns off or on black rectangles
that marks the viewport locations on the output displays. In CoutFrame.cpp
the #define SAVEOUTIMAGE will save the output image that the machine is
showing on the tiled display wall in that machines /tmp directory. Collecting
images from all the output machines and running them through metapaste.c
62
in the meta/tools directory will combine them into a single image which could
then be made into an AVI. Likewise, in the plugins, the #define SAVEFRAME
will save the rendered viewport into the /tmp directory. The plugins also
support a stand-alone mode if make -f Makefile.sa is used. This allows the
individual rendering machine to run sans MPI and display its viewport on the
local display. This can be useful for debugging.
5.4 Conclusion
The emulator presented in this chapter allows for the development of
full featured applications for the Metabuffer architecture. While it cannot
approach the performance of the Metabuffer hardware, the software emulator
gives good enough speed to allow interactive testing of Metabuffer applications.
Once written for the Metabuffer emulator, applications can easily be
ported to work with the Metabuffer hardware. At most, a simple library
should be all that is needed to abstract the interaction with the video card
frame buffer to that of the Metabuffer emulator plugin API.
63
Chapter 6
Greedy Viewport Allocation Algorithm
6.1 Introduction
Given a triangular mesh, it is very important to distribute the triangles
properly in order to achieve a good load balance among parallel rendering
servers. A parallel system is only as fast as its slowest member, so ensuring
that the work is evenly distributed is paramount to obtaining good timings
and therefore good speedups and processor utilization efficiency.
This chapter explores the problem of load balanced triangular mesh
partitioning for the rendering servers of the Metabuffer. The goal is to dis-
tribute the triangles in a mesh in such a way so that all the triangles are
rendered by one server, they are evenly balanced, and that each grouping of
triangles is located within a screen sized area in the overall display space.
The last issue is very important in order to create a fully high resolution
display since the renderer graphics cards are limited to a screen worth of
output image data. Only if each group of triangles can fit completely within
the frame buffer of each graphics card can a completely high resolution display
be composited together. The multiresolution capability of the Metabuffer will
64
be exploited later in chapter 8 to provide time-critical progressive rendering
with constant frame rates while the user is aggressively panning and zooming
the scene.
6.2 Background
6.2.1 Sort First Algorithms
Samanta [44] discusses several partition algorithms for the SHRIMP
sort-first system. As a sort first system, these algorithms attempt to find
nonoverlapping regions of screen space that can be distributed among proces-
sors so that each machine will have an equal rendering load. The algorithms
used by SHRIMP are grid bucket, grid union, and kd-split.
Grid Bucket
In the SHRIMP implementation of the grid bucket algorithm, the entire
screen space is divided up into squares. Groups of squares are then assigned
to renderers in an evenly balanced way in order to load balance the rendering
work. A heuristic is used to estimate the costs associated with having a par-
ticular square rendered by a particular machine. In the case of SHRIMP, these
costs can be significant, since pixels must be transferred for every square that
is not rendered on the machine driving the squares display. Using the polygon
distribution and these statistics, the squares are divided evenly.
65
Grid Union
The grid union algorithm tries to improve on one of the main defi-
ciencies of the grid bucket algorithm as relating to the SHRIMP sort first
architecture. Dividing the screen space up into small squares and then assign-
ing those squares to different rendering machines means that many polygons
located on the edges of the squares will have to be rendered twice. To prevent
this, the grid union algorithm attempts to merge adjoining squares on the
same renderer. Thus, there will be fewer polygon overlap penalties.
KD-Split
The kd-split algorithm avoids the overhead of partitioning the screen
space into many very small squares and instead recursively partitions it–first
in one dimension and then in the other. For example, for a given screenful of
polygons, the algorithm determines where in the display a vertical line would
divide the image evenly in terms of polygon rendering time. The amount of
rendering work on the left would be equal to the amount of rendering work on
the right. Next, two horizontal lines evenly divide each of the evenly divided
halves. This is done successively until the screen space is partitioned into the
correct number of tiles needed for the number of renderers.
The kd-split minimizes the amount of polygon overlap due to the fewer
number of partitions. However, keeping the rendering workload local to the
display machine is problematic for the SHRIMP system. The kd-split algo-
66
rithm also has the effect of generating partitions of varying sizes. In some
cases the partitions could be bigger than the rendering capabilities of the
graphics cards used on the machines, necessitating further subdivision. Still,
the kd-split algorithm usually performed the best in the testing presented in
Samanta’s paper.
6.2.2 Sort Last Techniques
As a sort-last image composition system, the Metabuffer has a few
more freedoms compared than SHRIMP. First, it allows overlapping images
rendered by different processors. This provides more flexibility in assigning
rendering processors to image space. It eliminates the polygon overlap over-
head that SHRIMP encounters when it needs to render the same polygon twice
on adjoining regions belonging to different machines. Second, the fact that
Metabuffer viewports can be located anywhere on the overall display space
means that the pixel redistribution overhead seen in SHRIMP is also gone.
However, as a result of its architecture the Metabuffer has a constraint
that does not severely affect SHRIMP system. As stated previously, in order
to obtain a high resolution display, and to obtain results comparable to that
of SHRIMP, every viewport must be the size of a single display tile. The use
of multiresolution can temporarily avoid this constraint, but it is a necessary
requirement to get a high resolution output. The SHRIMP system also must
abide by this constraint, but it can subdivide large regions and render them
67
separately if needed, while the Metabuffer does not have this option.
The additional freedoms and the additional constraint imposed by the
Metabuffer means that the polygon assignment algorithms for the SHRIMP
system are not applicable for the Metabuffer architecture. Instead of dividing
the screen space into regions of varying size as is the case with SHRIMP, what
is really needed is an algorithm that fully covers the polygons with squares
(viewports) of constant size (resolution).
Shifting Strategy
The conditions and constraints for this problem are analogous to the
covering with squares problem. The covering with squares problem can be
stated as follow: Given n points on a grid, find the smallest set of squares s
of a certain size covering all those points.
The Metabuffer viewport algorithm differs slightly from the covering
with squares algorithm. Instead of a minimal number of s squares, the Meta-
buffer case requires a constant number of v viewports, where v is the number
of viewports (renderers) available, and typically v >> s. Also, each Metabuf-
fer viewport must cover an equal number of polygons, where in the covering
with squares algorithm it is only required that the union of the squares cover
all the points. Still, given the solution to the covering with squares prob-
lem, it should be straightforward to determine the solution to the Metabuffer
viewport problem.
68
Because the covering with squares problem is strongly NP-Complete,
research has concentrated in finding algorithms that give approximate solu-
tions. Hochbaum [22] presents a bounded error approximation algorithm to
solve this problem using a shifting strategy. Likewise Gandhi [18] shows a
shifting strategy solution for a partial covering variant of this problem.
The basic concept of the shifting strategy is divide and conquer. Instead
of dealing with the entire screen space and finding the optimal covering using
brute force, the screen is divided into smaller parts with a smaller search space.
Even though the solution determined from these smaller search spaces may not
be optimal, it can be proven that it is optimal within a bounded error amount.
In the algorithm, each dimension is treated individually. The display
space I is divided strips of D width. The size of the search space is determined
by l, the number of contiguous strips that are used in the search. If l contiguous
strips are used, there are l different ways to assign the strips in lD widths (in
essence, shifting over each time a D amount). All l of these ways are searched
in a brute force manner to determine the best covering. The smallest result
from the l outcomes is used for the final answer. This is repeated for each
dimension.
Hochbaum shows that such an algorithm runs in O(ldn2ld+1) time where
l is the number of contiguous strips considered, n is the number of points, and
d is the dimension.
69
6.3 Implementation
Even though the shifting strategy gives a bounded error approximation
to the covering with squares problem, it is too slow for dealing with the large
numbers of polygons in the Metabuffer viewport problem. From the order
analysis, with two dimensions and the smallest and most error prone l of one,
the algorithm is still O(n5). Given a large data set where n could be millions
of polygons, the time required to compute the covering would be quite large.
Because of this, a much simplier greedy algorithm is presented in this
dissertation. While it cannot guarantee a bounded error answer, it does run
in O(nlog(n)) average time. In fact, for large numbers of polygons, the most
computation time is required by the quicksort, which although is O(n2) in the
worst case is typically closer to O(nlog(n)) for an average run. The greedy
algorithm for load balancing a mesh into a set number of viewports for the
Metabuffer is described formally as follows.
Conditions
1. There exists a screen of n tiles and m rendering servers (m ≥ n). Each
tile has the same size of w × h pixels and each server has the same
rendering capability c triangles/second.
2. There are p triangles that project into the screen. We assume each
triangle takes the same amount of time to render.
70
Constraints
1. To compare fairly with the results of [44], the viewport size of each server
should be the same as the size of a tile, w× h. However these viewports
can overlap each other, which differs from the sort-first approach of[44].
2. Every triangle should be covered by the union of viewports, and rendered
by at least one server. If parts of a triangle are rendered by different
servers, the triangle is counted multiple times.
3. A triangle can only be rendered by servers whose viewports cover the
triangle. In other words, there is no communication of pixels between
different servers.
The goal is to find the best placement of the viewports and assignment of
triangles to viewports such that the triangles are rendered in the shortest
time.
Lemma 1. If the total number of triangles is p and each rendering server can
render c triangles per second, the best possible time is pm×c . The worst case
time is p(m−n+1)×c . It happens when almost all triangles project to a single tile
but few triangles scatter in other tiles such that n viewports are necessary to
cover all triangles.
Any other cases can be rendered in p(m−n+1)×c time, because there are
at most (m−n) viewports that have more than pm−n+1 triangles after we cover
71
the whole display with n viewports. Those extra triangles can be assigned to
the remaining m − n viewports such that each rendering server has no more
than pm−n+1 triangles to render.
The steps of the proposed greedy viewport allocation algorithm are as
follows:
1. The center of mass of triangles is found by taking a weighted average of
the two dimensional coordinates of the projected bounded box for each
triangle.
2. The triangles are sorted by the distance to the center of mass.
3. In the order of decreasing distance, each triangle is assigned to a view-
port. If no viewport can cover the triangle, a new viewport is created. If
multiple viewports can cover the triangle, the triangle is assigned to the
viewport with the least mobility (ie the viewport whose previous triangle
assignments allow it to be moved the least). If a viewport has a triangle
count a predefined percentage higher than the optimum average polygon
load, it is closed for additional triangle assignment.
4. A final series of passes are made over the triangle list during which time
viewports with a higher than average number of triangles attempt to
assign their triangles to viewports with lower than average counts.
The strategy this algorithm employs is to assign the far flung triangles
72
Figure 6.1: Viewport configuration for horse example.
first, since the area in the center of the image is most likely to be covered
by a high number of viewports while the edge region is less probable to have
many choices in coverage. The algorithm also attempts to maintain the highest
degree of mobility for each viewport for as long as possible. This means that
there will be more viewport choices for triangles later in the assignment chain.
The algorithm requires a sort of all triangles in the scene, which provides a
lower bound on its complexity. Using a large number of renderers (and thus a
large number of viewports) the algorithm is O(pm).
6.4 Results
Figure 6.1 shows how the triangles of a horse model are divided into
eight viewports in an eight renderer Metabuffer configuration. The rectangles
73
show the computed viewports, which have the same size but can be positioned
arbitrarily in the image space. The colors of the triangles illustrate which
viewport to which they have been assigned. The total number of triangles in
the horse model is 22,258. The number of triangles in each viewport varies
from 2,782 to 2,783.
The load balancing algorithm took 0.051 seconds to process the horse
model: 0.008 seconds to compute the center of mass and distances, 0.019
seconds to sort the triangles, and 0.024 seconds to assign the viewports. The
concentric circles show the distance from the computed center of mass and
the rectangles show the computed viewports. It is obvious from the radiating
triangle assignments that the algorithm depends heavily on the distance from
the center of mass. Obviously the algorithm in this case was helped by the
compactness of the horse and the relatively even distribution of triangles,
although the high level of detail in the head tested its abilities.
Figure 6.2 plots the timings of the horse model and several other data
sets of varying size to demonstrate the complexity of the algorithm. For the
horse model, the assignment of triangles to the viewports took the longest
time. However, as seen by the graph, as the triangle count increases for the
larger models, the sort time overshadows all other parts of the algorithm due to
its greater complexity. Using the quicksort, the sort time line is O(mlog(m)).
This contrasts with the viewport assignment time line which is O(mp) and
therefore is linear if the number of triangles grows and the number of processors
74
0
50
100
150
200
250
0 2e+06 4e+06 6e+06 8e+06 1e+07 1.2e+07
Sec
onds
Number of Polygons
Greedy Viewport Assignment Algorithm Timings
"Assign""Distance"
"Sort""Total"
Figure 6.2: Greedy algorithm timings for various model sizes
75
is kept constant. Likewise, the distance calculation is simply O(m) and is
linear as well. The total of these three parts of the algorithm primarily reflect
the contribution of the sort time resulting in an O(mlog(m)) algorithm when
m >> p.
6.5 Conclusion
The greedy algorithm presented in this chapter gives fast viewport as-
signments that consist of evenly balanced triangle counts. Using the method
presented here of assigning far flung triangles first and attempting to maximize
the mobility of existing viewports, viewport assignments are able to cover all
of the triangles evenly, while at the same time limiting the spatial area they
are required to render in order to increase the resolution for that particular
viewport.
This algorithm will be used extensively for the progressive image com-
position plugin presented in this dissertation. It will be used to initially assign
the triangles to the viewports in a load balanced manner while giving a com-
pletely high resolution display for the initial viewpoint.
76
Chapter 7
Wireless Visualization Control Device
7.1 Introduction
When using very large, multiscreen, tiled displays in conjunction with
the visualization of large data sets, it is important for the user (or users) to
be able to interact easily with the application. In the case of the Metabuffer
[4] project, this is especially true since the aim of using its multiresolution
capabilities is to increase user responsiveness. The Metabuffer currently has
two different multiresolution plugins, each of which requires an easy, portable
user interface.
The first plugin, progressive image composition, uses multiresolution
in order to hold frame rates steady regardless of changing user viewpoints.
It also uses polygon redistribution in order to create high-resolution displays
when the user pauses to analyze key areas of the data set. Programming in
predetermined routes for the data set to be manipulated, while demonstrating
the technique, does not show how the plugin would respond in the real world
to random user input. By tying the plugin to a user interface, the user is free
to stress the plugin by changing views, zooming in or out, or simply navigating
77
through the data set. With this real world interactivity, it is more apparent
how steady frame rates increase the responsiveness that a user gains and the
value of the progressive image composition technique and multiresolution in
general.
The second plugin, foveated vision, tracks the gaze of one or more
users and renders those areas in high resolution. Areas in the periphery of
the user’s view are rendered in low resolution. Therefore, it is necessary to
have a user interface that allows the tracking of multiple users’ gazes at once.
Again, preprogrammed gazes, while demonstrating the technique, do not show
the advantages that higher frame rates have for responsiveness in real world
navigation and thus the value of the foveated vision technique and the use of
multiresolution.
To create such a mobile user interface, standard COTS Windows CE
Pocket PC devices were selected. Equipped with wireless Ethernet PCMCIA
cards, they are lightweight, small, relatively inexpensive, and user friendly.
While eventually gaze trackers on headsets will provide foveated vision infor-
mation, in the meantime the wireless Pocket PC devices serve in this role.
Using mobile computing in conjunction with the Metabuffer opens up many
new possibilities for user interactivity. The next section details the current
state of research in using mobile devices for user input. After that, this paper
discusses the design of the Metabuffer mobile system and to date implemen-
tation results.
78
7.2 Background
Historically, research in using mobile computing for user interfaces can
be divided into three main areas: ubiquitous computing, augmented reality,
and context aware applications. In many cases research projects fulfill the
requirements of more than one area, especially as COTS mobile computing
devices have become more powerful.
7.2.1 Ubiquitous Computing
Ubiquitous computing essentially means bringing the concept of com-
puting out of the computer room and into everyday lives. Instead of doing
work with a computer while sitting in front of a monitor, people go about their
daily activities with computers integrating seamlessly into the environment.
This concept was coined by Weiser [51]. Ironically for the Metabuffer
project, Weiser considers that ”ubiquitous computing is roughly the opposite
of virtual reality.” To him, virtual reality involves putting the computer at the
center whereas ubiquitous computing should revolve around the real world.
Of course, in the case of the Metabuffer, wireless mobile devices are being
integrated into a virtual reality environment.
While the definition may not match the Metabuffer’s application, con-
cepts of ubiquitous computing research certainly do. At Xerox’s PARC lab,
wireless mobile devices called tabs were employed to keep track of roaming
79
employees and allow those employees to remotely set temperature, light, and
humidity levels in different rooms. This is analogous to the types of data the
wireless devices would provide as input for Metabuffer visualization applica-
tions.
7.2.2 Augmented Reality
With augmented reality, virtual reality is used, but only to supplement
the information in the real world. Usually this is through head mounted dis-
plays in conjunction with other computers worn on the user. When the user
walks around his or her environment, the computers display additional infor-
mation over the real world scenes that inform the user about state, structure,
or other attributes.
One type of augmented reality that uses handheld devices is called sit-
uated information spaces [14]. By tracking the location of the user, a handheld
can specify information that would be relevant to the user’s needs or task. For
example, if the user was next to a movie theater, the handheld could display
show times and ticket availability.
This idea could be used in Metabuffer visualization applications. By
knowing where the user is looking, the handheld could display additional infor-
mation. While examining a galaxy data set, for example, tracking the user’s
gaze at a certain star or celestial feature could reveal data about that object
on the handheld leaving the actual display uncluttered for other users to view.
80
7.2.3 Context-Aware Applications
Context-aware applications use information about the user’s location
in order to provide data at the right place at the right time. Essentially this
allows the user to roam freely with applications coming on-line customized to
his or her needs no matter the user’s locale.
Lamming [29] introduces the concept of memory prostheses. By record-
ing information pertinent to a user’s surroundings, this information can be
recollected in a similar circumstance in the future and thus provide the user
with an appropriate set of recalled information.
This concept can be applied to the Metabuffer visualization application
by allowing the wireless input devices to store information about the user’s
navigation patterns as they relate to individual data sets. In this manner,
users can set bookmarks of views in the data set and come back to those views
in the future. They may also be able to take notes on certain areas of the data
set. This information would be stored on the wireless unit independent of any
other user that happens to be viewing the data set and could be recalled at a
future time.
7.3 Implementation
The design of the Metabuffer system is relatively simple from a hard-
ware standpoint. Recent advances in wireless handheld technology have ren-
81
dered what used to be a complicated technical undertaking to just plugging
in a collection of COTS components.
The main piece of this puzzle is the Compaq iPAQ Pocket PC device.
This device runs the Windows CE operating system from Microsoft. The
Windows CE operating system is essentially the standard Win32 API with su-
perfluous parts removed to save space. For example, Windows CE is entirely
Unicode based. Therefore, all ASCII routines have been removed. Although
restrictions like these mean that code has to be written specifically for Win-
dows CE devices, most Windows programmers have little trouble adapting to
the new operating system. A big advantage for Windows CE programmers is
that Microsoft provides the Windows CE development environment and SDKs
free of charge in order to encourage growth of applications for the operating
system.
For wireless connectivity, an Orinoco RG-1000 residential gateway is
employed along with Lucent wireless Ethernet cards. The wireless Ethernet
cards plug into the iPAQs by means of a PCMCIA adapter. They are then con-
figured to talk to the RG-1000 which is connected to the Metabuffer cluster’s
LAN. From this point, communicating over the network is seamless.
The user interface application for the iPAQ is written in standard C
using the Windows CE API. It provides a way to manipulate the orientation
and zoom of the data set being examined, along with a means to provide gaze
information by clicking on a representation of the tiled screen space.
82
Figure 7.1: Wireless visualization device user interface
83
Figure 7.1 shows an actual screen shot of the user interface. The cube
in the screen shot can be rotated by using the iPaq’s stylus. This provides the
orientation of the model. At the top of the shot is a representation of the tiled
display wall. The longhorn icon is placed where the user is gazing, again via
the stylus. At the bottom of the shot is a slider bar which controls the zoom.
The orientation and gaze information received from the graphical UI is
transmitted over the wireless Ethernet as UDP packets to a server residing on
the land based host cluster. This server collects the information from all the
wireless devices and stores the current state of all locally.
At each frame, the Metabuffer application queries the server about the
status of the wireless users. This is done via a named pipes mechanism. The
server was separated from the Metabuffer application because the Metabuffer
emulator uses MPI as its basis. Currently Prism’s version of MPICH does
not support multithreading. Therefore running it as a separate process allows
the Metabuffer to run unencumbered. The individual process model will also
make it easier for other applications to have access to the same data.
Figure 7.2 is an overview of how the entire process works for a Meta-
buffer frame. First, the Windows CE iPAQ device collects information from
the user through its graphical interface. This information is then sent as UDP
packets by wireless Ethernet card to the Orinoco gateway antenna. The gate-
way relays the UDP packets to Prism, which is the firewall for the visualization
cluster. Prism forwards this particular port (currently port 6666) to the Al-
84
MPI process
Listener
MPI process ccvpipe
MPI process
MPI process
Prism
iPAQ
WirelessEthernet
RG−1000
Alpha13, etc
Alpha12
Alpha11
Alpha1
Figure 7.2: Wireless visualization operation
85
pha1 machine located behind the firewall. Running on the Alpha1 machine
is a custom UDP server application called “listener”. This server application
collects UDP packets being sent by the iPAQs and saves the most recent in-
formation. At each frame, the MPI process bound to the Alpha1 machine
queries a named pipe located on Alpha1’s file system called “ccvpipe”. When
this happens a separate thread from “listener” writes the current iPAQ data
to the named pipe. The Alpha1 process receives the data and broadcasts it
over MPI to all of the rendering machines.
7.4 Distribution
The source code distribution for the wireless interface is contained in
the file ccv.zip. It includes four subdirectories:
1. Linux: This directory contains the source code for listener.c, the UDP
server as well as test.c which simply requests information from the named
pipe. It also has a readme document for how to set up the UDP server
on Prism.
2. Source: This directory has the source code for the actual user interface
application.
3. Win32: This directory contains the projects that will build the user in-
terface for a Windows desktop or laptop machine using Microsoft Visual
Studio. It isn’t hard to make Windows CE programs cross platform with
86
their desktop counterparts, and allowing it to run on a desktop machine
facilitates testing.
4. WinCE: The directory has the Windows CE projects that will build
the user interface application for the iPAQ using Microsoft Embedded
Visual Tools.
Most people are familiar with compiling programs for the Windows
desktop environment using Visual Studio. Configuring a programming envi-
ronment for Windows CE is just as simple. Microsoft currently provides the
Embedded Visual Tools system and Windows CE SDKs for free on their web
site [35].
After downloading and installing the Embedded Visual Tools software,
building a Windows CE application is just like a Visual Studio application.
The only difference is to select the processor type of the Windows CE device
(in the iPAQs case it is a StrongARM) and the platform (in the case of the
iPAQ it is PocketPC). Embedded Tools will compile and link the code and
then send it to the device automatically if it is currently synced in it’s cradle.
Once the iPAQ is configured with the user interface code, it is time
to ready the cluster to receive the iPAQ’s UDP transmissions. First, create
the named pipe on Alpha1 (or whatever machine is assigned to receive the
packets):
mknod /home2/wjb/ccvpipe p
87
Next, ensure UDP packets are passed from Prism to the local machine
(in this case Alpha1 which has the IP address of 192.168.128.97) by logging
on to Prism and giving the following port forwarding command for port 6666:
ipmasqadm autofw -A -r udp 6666 6666 -h 192.168.128.97
Then, run the listener server on the local machine (Alpha1) to receive
the UDP packets and write to the named pipe:
listener
Configure the Metabuffer MPI process on Alpha1 to read the named
pipe to get information on positioning. Simply edit the enviro.h file to set
the WIRELESS #define to 1, the WIRELESSSERVER #define to the local
machine (Alpha1), and the WIRELESSPIPE #define to the full path of the
pipe just created.
If the plugin is written correctly, this process can then send the posi-
tioning information using MPI to the other machines so that everyone is syn-
chronized. Currently both the Progressive Image Composition and Foveated
Vision plugins support the wireless user interface.
A common problem that may be seen with the wireless interface is that
the Metabuffer emulator may seem to stall. This usually happens when the
listener UDP server is not running and therefore no data is being fed into the
88
named pipe. The MPI process waits on the empty named pipe and grinds to
a halt. If this happens please check to make sure listener is running before the
Metabuffer emulator and that it is located on the correct machine.
7.5 Conclusion
Using wireless devices have intriguing possibilities as a user interface
medium. In the future, ideas can be taken from past research in ubiquitous
computing, augmented reality, and context-aware applications to provide addi-
tional data on the handhelds. In combination with previous mobile computing
ideas and techniques, using wireless devices to control visualization applica-
tions should result in a more powerful interface for the user.
89
Chapter 8
Progressive Image Composition Plugin
8.1 Introduction
Progressivity is a user interface technique well understood by most
computer users. Perhaps the most obvious use of progressivity is in World
Wide Web browsing. When a user navigates through a website, typically the
largest images are downloaded in stages. First the image arrives quickly in low
resolution form. As time allows, more data is then downloaded from the server
in order to create a high resolution image. Because of this, the user is able to
quickly navigate the site using the low resolution images as aids to find the
page he or she is trying to find. Once the user arrives as that page, the high
resolution images are downloaded while the user is studying the information.
This technique allows the user to be unimpeded while navigating the site, but
still provides high resolution imagery where and when it matters most.
The problem of how to quickly navigate but still retain high quality
image output also exists for rendering large data sets in parallel on multiple
displays. To achieve good user interactivity, an application must guarantee
time-critical rendering of the massive data stream. However, for the instance
90
of displaying a triangular mesh, though a good load balanced partition among
the parallel machines can be computed for a given user view point, new compu-
tation and data shuffling are required whenever the view point is significantly
changed. Either triangles may fall out of the viewport because of the move-
ment of the viewing direction or the viewport cannot cover all the polygons
assigned to it because of zooming. Redistributing primitives or imagelets in
order to render all of the polygons correctly takes time. If the user is simply
navigating the data set, this additional time will result in slower frame rates
hampering user interactivity.
To solve this problem, we propose adapting the concept of progressivity
to the generation of images via image compositing on the Metabuffer, terming
the technique progressive image composition. The Metabuffer is a parallel,
multidisplay, multiresolution image compositing system [4]. To test the tech-
nique we are using the software emulator of the Metabuffer architecture [5].
By employing the Metabuffer’s multi-resolution feature, it is possible
to ensure the user will always have constant frame rates no matter what the
viewing angle or zoom factor. Instead of redistributing polygons or imagelets
while the user is rapidly changing views, a viewport can instead go to a lower
resolution and enlarge in order to accommodate the current polygons assigned
locally to the machine. When the user finally arrives at the view of interest and
stops changing viewpoints, frame rate is no longer a concern. At this point
polygons are be redistributed in order to once again form completely high
91
resolution viewports. This paper shows that progressive image composition
helps to provide for a good balance between user interactivity and frame rates
and image quality.
8.2 Background
The technique of progressivity has been studied by many research
groups for many different applications. Progressive transmission is used to
send information through a network, as with the case of the World Wide Web
for example. Progressive refinement is used for rendering images. Images may
first be created coarsely and then over time improved. Progressive image com-
position relies on a combination of both techniques in order to improve frame
rates.
8.2.1 Progressive Transmission
With the growth of the Internet, there has come a need for ways to
transmit large quantities of graphical information in varying levels of band-
width while still retaining a high degree of user interactivity. Because this
bandwidth can range from a slow analog modem up to a high speed fiber
optic connection, designing a web site to satisfy this requirement is difficult.
By using progressive transmission to regulate the bitstream of the data, it is
possible to satisfy both the slowest and fastest end user.
Shapiro [48] tells how wavelets can be used in image compression in
92
order to generate different bitstream rates. The essential idea is that the most
significant bits of the image are distributed first. This gives the end user a
basic idea of what is to come without downloading the entire picture. Over
time as the less significant bits are received, the image is refined.
Progressive image composition on the Metabuffer shares many of the
characteristics of progressive transmission. Time is the utmost concern in
progressive transmission. The user should be able to at least see a glimpse of
the output in the smallest amount of time by using a coarse representation.
Similarly, in progressive image composition, the goal is to hold frame rates
constant by using lower resolution (and thus lower bandwidth) versions of the
imagery to avoid lags due to network communication. This results is high user
interactivity even in the case of large data sets and relatively low rendering
resources.
8.2.2 Progressive Refinement
Progressive refinement is often used in radiosity in order to improve the
appearance of images over time. The more spare cycles that are available on
the machine, the more iterations can be spent adding to the detail of the final
picture. Forrest [16] shows how such an approach can improve antialiasing
results.
The Metabuffer’s use of progressive image composition is similar to how
progressive refinement has been used in preceding research. Imagery is first
93
computed in the quickest time possible, but over time computation can take
place to improve the quality of the final output. The Metabuffer achieves this
improvement by moving geometry primitives between the rendering machines
whenever the user keeps the view stationary in order to fit the primitives into
high resolution viewports. This is analogous to the computation that takes
place in a raytracing or radiosity application to further define the final image.
8.3 Implementation
There are three main steps in progressive image composition. First
the data set must be partitioned evenly across all of the parallel rendering
machines. In order to render very large data streams there cannot be a global
data set. Second, for each frame the viewport resolution and location must
be determined for each renderer. These rendered viewports are ultimately
composited by the Metabuffer and sent to the tiled display. Third, machines
are constantly determining how they can best adapt their viewports to the
current viewpoint and zoom factor by exchanging data in the background in
order to shrink the area covered by each renderer in image space and thus
create higher resolution imagery. These three steps are described below:
8.3.1 Initial Triangle Assignment
For the start of the visualization, the data set is distributed evenly
among all the rendering machines dependent upon the initial viewing param-
94
eters. The viewing parameters are important because the triangle partitions
assigned to each rendering server optionally should fit within a single high
resolution viewport. If the number of rendering servers is equal to the number
of displays and the resolution of the highest resolution viewport is equal to
the resolution of the tiles in the display this will always be possible.
Samanta [44] gives a variety of algorithms for solving this issue for a
sort first image compositing system. However, the Metabuffer, which is a
sort last image compositing network, allows more freedom in assigning image
space since viewports are allowed to overlap. In order to take advantage of
this addition flexibility and to obtain the best triangle distribution, a greedy
algorithm is currently used [5].
The greedy algorithm creates viewports by assigning the furthest tri-
angles from the center of mass first while attempting to retain viewport mo-
bility. In this context mobility means that assigning an additional triangle
for coverage to an already existing viewport will not limit its ability to shift
to accommodate additional triangles. By using this metric, the algorithm is
guaranteed to cover far flung polygons while still allowing for the best possible
load balancing of the bulk of triangles.
8.3.2 Viewport and Resolution Determination
With the triangles assigned, the next problem is how to determine
what parameters to use in order to make the individual images created by
95
the renderers blend with the rest of the composited display. This primarily
involves computing the viewport size and location for each renderer relative
to the overall viewing space.
To do this, the bounding box that OpenGL will use to rotate and
translate the renderer’s portion of the data set is set to the bounding box of
the data set as a whole. The coordinates of the bounding box for the renderer’s
portion of the data set is then computed in relation to this overall bounding
box. Thus, the bounding box of the triangles assigned to the renderer will be a
subset of the bounding box for the entire data set. By following the corners of
this subbounding box around the display space, the location of the renderer’s
polygons can be precisely tracked and measured.
At the beginning of each frame, a viewing frustum is created for the
entire tiled display using glFrustum(). The projection matrix obtained from
this call won’t be used to actually create an image. Rather, it will be used
to calculate the screen coordinates of the subbounding box corners for the
renderer’s portion of the data set. The modelview matrix is also obtained
after the proper rotations and translations of the object have occurred. Given
these two matrices, it is easy to determine where the eight corners of the subset
bounding box for the renderer’s portion of the triangles would lie on the overall
display space.
96
xe
ye
ze
we
= M
xo
yo
zo
wo
(8.1)
As shown in equation 8.1, each bounding box’s corner object coordinate
is first multiplied by the modelview matrix to correctly rotate, translate, and
scale it in order to compute the eye coordinate.
xc
yc
zc
wc
= P
xe
ye
ze
we
(8.2)
The eye coordinate is then multiplied by the overall projection matrix
in equation 8.2. This obtains the corner’s clip coordinate in the display space.
xd
yd
zd
=
xcwcycwczcwc
(8.3)
Equation 8.3 then normalizes this into the device coordinate. A simple
scaling of the device coordinate yields the exact locations of the subbounding
box corners.
97
With the overall display coordinates in hand, the minimum and maxi-
mum x and y coordinates are found from the eight corners. These values are
then clipped by the boundaries of the overall display space. The extent of
the final x and y coordinates of the subbounding box determine the size and
therefore resolution of the viewport that will be needed. The values of the x
and y coordinates also determine the position of the viewport in the display
space.
Rendering the data given the viewport size and location usually requires
setting up an asymmetrical frustum. Though the user will be looking at the
entire display as a whole, a renderer will only be creating a subset of that
display. Thus, the frustum for this renderer must originate at the eye, but will
be off center depending on the locations of the viewport and not perpendicular
to the projection plane (except in the case of a perfectly centered viewport in
the overall display of course). Figure 8.1 shows the projection issue for the
progressive image composition plugin. While the centerline of the overall view
is perpendicular to the projection plane, the centerline of the viewport view
is not. Creating a symmetric frustum for the viewport view will yield an
inaccurate rendering of that portion of the scene.
The issue of asymmetrical frustums are most often encountered when
rendering stereo images. Because our two eyes are slightly off center, yet both
look at the exact same area, the frustum has to be slightly off center and not
perpendicular to the projection plane in order to get the correct projection for
98
User
Overall view(Symmetric frustum)
Viewport view(Asymmetric frustum)
Display (Projection Plane)
Figure 8.1: Asymmetrical frustum illustration
a stereo image. If this is not taken into account, “toe in” will result which
essentially distorts the resulting three dimensional effect. In the case of image
compositing, the affects of ignoring this projection problem will cause the
disparate images not to align correctly when composited.
Because most OpenGL implementations support stereo rendering, it
is very easy to establish an asymmetrical frustum. By taking the extents of
the viewport screen locations and mapping those back to the original frustum
values of the overall display space, it is possible to correctly determine the
frustum needed for each viewport.
99
8.3.3 Data Exchange
Over time, after the user has navigated the data set, it is very likely
that many of the Metabuffer viewports will need to shift to lower resolutions
in order to accommodate all of the triangles for which they are responsible.
Some kind of process, therefore, is needed to redistribute the triangles in order
to regain a collection of viewports that are all in the highest resolution for the
given viewport.
In order to do this, we propose a method based on progressive refine-
ment where excess cycles are used in the background to continually redistribute
polygons and shrink viewport sizes. While the user navigates, viewports may
need to shift to lower resolutions because all of their triangles do not fit within
the high resolution clipping area. However, the user will see no reduction in
frame rate because the partitions are still evenly load balanced and no com-
munication has had to take place. When the user sees something interesting
in the scene and starts to study it, the servers finally have time to redistribute
the blocks among themselves to try to reduce the size of the viewports and
increase resolution and therefore image detail.
To have a reasonable granularity for reshuffling triangles, we may break
the bounding box of the surface into hierarchical boxes [2] blocks and use the
block as the unit of distribution. At the heart of the scheme is a central server
that manages the allocation of those blocks. The central server keeps track of
block assignment to rendering servers and takes the view that the user has cho-
100
sen and determines which blocks when transferred between two processors will
help reduce viewport sizes for that particular view, while preserving the load
balance property. It tells the rendering servers connected to the Metabuffer
which blocks to render and which blocks to send to other rendering servers.
The rendering servers themselves store only the blocks of triangles that
they are currently assigned. If the central server tells a rendering server to
ship a block, that server sends the block directly to the other server over
the network. The longer a user looks at a particular view, the more blocks
that can be transferred over the network between the servers. In essence,
the speed of the network used in the cluster affects only the speed of the
progressive improvements in resolution and not the speed that the user can
navigate through the data set.
8.4 Results
The configuration used to test the progressive image composition plu-
gin consisted of 19 machines in our visualization cluster. Each machine was
equipped with a high performance Hercules Prophet II graphics card, 256 MB
of RAM, an 800 MHz Pentium III processor and ran the Linux operation sys-
tem. 9 of the machines were set to actually emulate the Metabuffer hardware.
They performed the image compositing and output of the 3 by 3 tiled display
space. The other 10 machines were tasked with actually rendering the scenes.
All 19 machines were connected via 100 Mbps Fast Ethernet. We lim-
101
ited the test to 19 machines instead of the full 32 in the cluster with graphics
cards because the higher amounts of data transfer exceeded the capabilities
of the network and significantly slowed emulator performance. We anticipate
that the addition of Compaq’s ServerNet II to the cluster will greatly reduce
this constraint. The actual Metabuffer design, when put into hardware form,
eliminates this overhead entirely.
Dataset Size Viewport RenderOceanographic 392,332 6.6 seconds 0.03 secondsSanta Barbara 6,163,390 88.6 seconds 0.44 secondsVisible Human 9,128,798 135.36 seconds 0.78 seconds
Table 8.1: Progressive data set information
Three data sets of different sizes were used to demonstrate the perfor-
mance of the progressive image composition plugin for the Metabuffer emu-
lator. Table 8.1 gives the size of each data set, the time it took to initially
precompute the triangle to viewport assignments using the greedy algorithm,
and finally the average time needed to render each frame of the 720 frame
movies presented in this report. As will be shown in the following graphs, the
per frame timings are constant irrespective of the user’s viewpoint, so these
averages essentially tell the frame rate for the entire movie.
Also, even though the data sets used were of increasing size, the number
of rendering machines was kept constant at 10. This means that in the case
of the visible human, frame rates were slower than what would be needed for
a real time display. Including more machines as renderers would reduce the
102
workload of each machine and lower the rendering time to real time 30 frames
per second rates. The only penalty imposed by the Metabuffer hardware for
scaling up to more renderers is a few pixels of latency per machine with no
drop in throughput.
8.4.1 Oceanographic
The oceanographic data set is an isosurface generated by Zhang [53]
consisting of 392,332 triangles. It shows the topography of the ocean floor.
Dividing the data sets into 10 load balanced viewports yielding 39,233 triangles
per renderer.
To demonstrate that the frame rates do not change regardless of the
user’s viewpoint a 720 frame movie was generated in which the data set was
zoomed in and zoomed out while constantly being rotated. A sample of the
frames taken throughout the movie is included in figure 8.2.
At the beginning of the movie, the image is cleaved into the 9 tiles that
form the 3 by 3 tiled display space. During the movie, these tiles are rejoined
to show the overall display, and then separated again at the end to reinforce
the fact that the Metabuffer is acting on a multitiled display space.
The black boxes visible in the frames show the viewport locations. As
the data set is zoomed in and out, it is readily apparent when the viewports
shift from high resolution to low resolution by the sizes of these black boxes.
103
Frame 3 Frame 79
Frame 155 Frame 235
Frame 360 Frame 422
Frame 461 Frame 531
Frame 605 Frame 707
Figure 8.2: Sample frames from the oceanographic movie
104
Initially, the individual viewports belonging to each renderer are cycled
around in a circle to demonstrate that they can be located anywhere within the
global display space and are indeed disparate. Each viewport is color coded
according to the renderer that drew it.
Later, the viewports are composited together to form the data set.
The user zooms in while rotating the scene. As this is occurring, viewports
dynamically move and resize themselves to adjust to the expanding extent
they must cover to render all of their triangles. Finally, the user zooms out
and the viewports shrink.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 100 200 300 400 500 600 700
Sec
onds
Frame Number
Progressive Oceanographic Movie Timings
"Renderer0""Renderer1""Renderer2""Renderer3""Renderer4""Renderer5""Renderer6""Renderer7""Renderer8""Renderer9"
Figure 8.3: Rendering times for oceanographic movie frames
Figure Fig:progocgraph gives the timings for the oceanographic movie
105
throughout all 720 frames. For comparison with the other data sets they are
scaled from 0 to 0.85 seconds. Note that the timings for each frame are almost
completely flat. No communication has to occur between frames, and this lack
of overhead means that the user sees no drop in interactivity regardless of how
the data set is viewed. From the graph, it is evident that the renderers are all
reasonably load balanced.
8.4.2 Santa Barbara
The Santa Barbara data set is an isosurface taken of the gravity fields
for a galaxy. This data set is almost 16 times larger than the oceanographic
one shown previously.
Figure 8.4 shows some sample frames from the 720 frame movie. Just
as with the oceanographic example, the viewports are first circled to show that
they are distinct. Afterwards the data set is zoomed in and zoomed out while
constantly being rotated. Again, each viewport is color coded according to
the renderer that drew it.
The graph in figure 8.5 reveals timing results similar to that of the
oceanographic example. Again, they are flat, owing to the lack of interframe
communication needs. The viewports are also relatively well load balanced
resulting in efficient use of all 10 renderers.
106
Frame 3 Frame 89
Frame 160 Frame 240
Frame 325 Frame 362
Frame 474 Frame 546
Frame 617 Frame 715
Figure 8.4: Sample frames from the Santa Barbara movie
107
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 100 200 300 400 500 600 700
Sec
onds
Frame Number
Progressive Santa Barbara Movie Timings
"Renderer0""Renderer1""Renderer2""Renderer3""Renderer4""Renderer5""Renderer6""Renderer7""Renderer8""Renderer9"
Figure 8.5: Rendering times of Santa Barbara movie frames
108
8.4.3 Visible Human
The final sample data set is an isosurface taken from the visible human
model. This data set is more than 23 times larger than the oceanographic
example.
Figure 8.6 reveals sample frames from the 720 frame movie. The view-
ports circle and are then composited together to form the overall display. Again
the data set is zoomed in and zoomed out while being constantly rotated. As
with the other two examples, each viewport is color coded according to the
renderer that drew it.
Figure 8.7 shows the timings of the movie. Just as with the previous
two, they are flat resulting in constant frame rates for the user and good
interactivity. However, these timings range from 0.43 seconds all the way to
0.78 seconds. The viewports that were created all had an equal number of
polygons assigned to them. But, in some cases, the number of polygons is not
an accurate representation of the rendering load. It is obvious that in this
particular case, some other metric will need to be used to load balance the
data set evenly.
All of the frames for these movies were created for a 3 by 3 display to
facilitate an easier presentation of them for this article. In reality the cluster
hosting the Metabuffer is connected to a 5 by 2 tiled display space in our
visualization laboratory. Typically 10 machines are used to do the Metabuffer
109
Frame 2 Frame 82
Frame 160 Frame 236
Frame 316 Frame 365
Frame 431 Frame 543
Frame 646 Frame 702
Figure 8.6: Sample frames from the visible human movie
110
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 100 200 300 400 500 600 700
Sec
onds
Frame Number
Progressive Visible Human Movie Timings
"Renderer0""Renderer1""Renderer2""Renderer3""Renderer4""Renderer5""Renderer6""Renderer7""Renderer8""Renderer9"
Figure 8.7: Rendering times for visible human movie frames
Figure 8.8: Composited visible human in visualization lab
111
emulation, each responsible for driving one of the displays. The composited
visible human is pictured in figure 8.8 from our visualization laboratory during
an emulator run.
8.5 Conclusion
Because of the pipelined design of the Metabuffer, more machines than
the 10 used here in these experiments could be harnessed. The only penalty
would be an increase in latency, and this increase would be measured in pixels–
a small tradeoff. Given the resources, there is no limit to how many machines
can be added and thus how many times the triangles can be divided into
smaller and smaller viewports. A target of 30 frames per seconds with a large
cluster is feasible for data sets of the sizes presented in this report.
In the future, we will explore methods to redistribute the polygons to
shrink viewport sizes and increase resolution interactively based on the client
server framework illustrated in this report. This processing can take place in
the background while the user is studying a scene. We do not anticipate that
this data movement will affect frame timings in any way. Rather, it will simply
increase the resolution of the scene via progressive refinement.
The application of progressive image composition with the Metabuffer
shows how the Metabuffer architecture can assist in improving load balancing
and user interactivity while still achieving high quality output images. Pro-
gressive display is a common feature in many computing applications (most
112
notably web browsing) and has been accepted by users as an adequate way to
present data in order to gain interactivity. The ability to provide fast frame
rates for data set navigation while still allowing for high-resolution output im-
ages at arbitrary viewpoints provides a good balance between speed and image
quality.
113
Chapter 9
Foveated Vision Plugin
9.1 Introduction
Our own eyes can only sense detail directly where we look. Objects in
our peripheral vision appear in low resolution and lack definition. This basic
biological fact is a result of the concentration of rods and cones in the retina
of the human eye. A higher concentration exists at the center with the density
gradually becoming lower and lower towards the edges. In fact, the human
eye even has a blind spot where nerves exit the eye ball and there are no rods
or cones at all. Our brain processes what we are seeing in order to account
for the blind spot and differing rod and cone densities. As a consequence
of human biology, even though computer visualization systems may render a
large display in high resolution, by the time that information gets to our brain,
much of the information has been lost by the limitations of our vision system.
Because visualization displays and data sets are becoming larger, this
fact has important consequences. Already cave type virtual reality labs employ
multiple projectors for an immense immersive display. By tiling the higher
resolution projectors or panels available today, creating enormous displays
114
with billions of pixels is practical. IBM, for example, currently has a 3000 by
3000 pixel LCD panel consisting of 9 million pixels. Creating an 11 by 11 grid
of those panels would result in a display consisting of over one billion pixels.
Rendering such displays in high resolution to visualize extremely large data
sets uses a tremendous amount of computing resources, takes a large amount of
time, and thus results in slow frame rates. This despite the fact that, because
of our limited vision systems, much of the display either won’t be seen at all
(because it is behind us in a cave arrangement) or only in the periphery in low
resolution.
This is the concept for the foveated vision application for the Metabuf-
fer. The Metabuffer is a parallel, multidisplay, multiresolution image composit-
ing system [4]. Using the physical characteristics of the eye as an advantage,
the computing resources of the Metabuffer are matched to the areas in the
display that are being examined. The majority of the rendering servers con-
centrate their work where the user is gazing. In this manner a high resolution
image is generated quickly exactly where the user is focused. The periphery
of the display is rendered in lower and lower levels of resolution and detail
corresponding to the rod/cone concentration in the human eye. This allows
only a few renderers to be used to create the entire periphery of what could be
a building-sized display. To test the procedure we are using the software em-
ulator [5] of the Metabuffer architecture. This paper shows that the foveated
vision technique on a parallel, multidisplay, multiresolution image composi-
tion system concentrates rendering power where it is needed helping lower
115
computation cost resulting in high frame rates and good user interactivity.
9.2 Background
Using the foveation of the human visual system as an advantage is
nothing new. Several research groups have tackled problems such as image
transmission and image processing by using the low resolution areas of the eye
as an asset.
9.2.1 Image Processing
One problem that benefits from foveated techniques is image processing.
Image processing is often a very computationally intensive task. Every pixel in
an image must have calculations performed on it to perform pattern matching,
edge detection, or other operations.
Many times, though, this image processing is being done to simulate
what a normal human eye would be seeing. Facial recognition is one such
example. The human eye lacks detail in its peripheral view. Therefore, the
brain does not have to process nearly as much information from the edges of
the view as it does in the center. This hindrance actually helps the brain by
preventing an overload of visual stimulus.
Researchers have taken advantage of this fact by using methods to avoid
processing the enormous quantities of high resolution pixels in the periphery.
116
After all, since the brain does not have to deal with these peripheral pixels,
neither should the computer. Special foveated CCD cameras have been de-
veloped which record high resolution only at the center of the gaze in order
to lessen the information overload resulting from taking in imagery from high
resolution cameras which sense all areas equally [13, 52, 41]. Image processing
applications can then take advantage of this reduced imagery to concentrate
their algorithms on the center of the scene, rather than the edges of the gaze,
just as the brain does in conjunction with the human eye.
9.2.2 Image Transmission
Another problem which has used foveated vision is image transmis-
sion. Full motion video can require large amounts of data to be transmitted.
Usually the amount of bandwidth available is the limiting factor facing this
transmission. Any technique that lessens the need for data will greatly help
the image transmission problem. Since the peripheral vision of the human eye
cannot see high resolution imagery, it makes little sense to have to transmit
this peripheral image data that eventually will not even be processed by our
vision system.
This is the technique used by Geisler [19]. His research applies foveated
techniques to MPEG encoding. Essentially the MPEG stream is recorded
at successive levels of resolution. By recording the user’s gaze, a “foveated
pyramid” is created with high resolution imagery in the center which becomes
117
successively lower the farther the imagery happens to be from the user’s gaze.
Geisler reports that with foveated techniques the MPEG bandwidth
requirements dropped by a factor of three. He also states that if bandwidth is
kept constant, frame rates could instead increase by a factor of three. Finally
Geisler remarks that foveated techniques could easily be applied to image
generation by using low resolution and reduced levels of detail. These very
techniques will be exploited in the Metabuffer plugin.
9.2.3 Image Generation
Although not specifically tied to applications involving eye tracking,
several research groups have studied using multiresolution to speed up image
generation. Hoppe [23] illustrates how progressive meshes can be used to
significantly increase performance when rendering large data sets. He shows
how different levels of detail can be used depending if the data is close or
far away from the user. Shamir [47] reveals how to use DAGs in order to
efficiently create multiresolution meshes on time varying deforming meshes.
Magillo [32] presents a library in order to model multiresolution meshes. Saito
[43] discusses how to use wavelets to compactly encode and efficiently retrieve
hierarchical multiresolution representations of objects.
Progressive meshes will be used by the Metabuffer foveated vision ap-
plication to present level of detail views of the scene to the user based on gaze
location. Currently, the progressive meshes used by the Metabuffer foveated
118
vision application do not use wavelet compression, but this method could serve
to compress source data further to better handle large data sets.
9.3 Implementation
0 101020 2040 4060 600
0.2
0.4
0.6
0.8
1.0
Rel
ativ
e A
cuity
Blind Spot
Visual Acuity Across the Retina
Degrees from Fovea
Figure 9.1: Coren’s acuity graph
Acuity is the term used to describe the eye’s ability to resolve detail.
Typically, this measurement is expressed as an angle corresponding to the
smallest span the eye can identify. As shown in figure 9.1 by Coren [9], acuity
changes as a function of the distance away from the center of the eye. This
is due to the concentration of rods and cones in the retina. The highest
concentration exists at the center of the eye in the fovea, with the density
119
becoming less and less towards the periphery. A blind spot exists where the
optic nerve exits the eyeball.
Coren’s graph reveals that the drop off in acuity, and thus resolution, in
the eye is exponential. In fact, within 10 degrees it drops by almost 80 percent.
By matching the rendering resources of the computer graphics system to this
acuity graph, the rendering power of the system can be concentrated mainly
in the areas where it is needed most–the center of the user’s gaze. Only a
small portion of the system is needed to generate the low level of detail and
resolution towards the periphery.
A foveated vision system can be designed using Coren’s graph either
via the continuous method or the discrete method. The discrete method using
the hardware capabilities of the Metabuffer will be covered in this paper.
9.3.1 Continuous Method
With the continuous method, level of detail and resolution is matched
directly to Coren’s acuity graph. By using a wavelet encoded mesh, it is
possible to finely adjust the complexity of the scene. Depending upon the
distance from the center of the user’s view, an error value corresponding to
Coren’s graph can be used to walk through the wavelet encoding in order to
obtain the proper amount of detail for every area in the scene. Likewise, this
same error value can be used to adjust the level of resolution used to generate
the scene. Higher error values would allow lower levels of resolution. A very
120
similar method employing hierarchical bounding boxes [2] could also be used.
In either case, delays resulting from data locality issues could be quelled by
utilizing progressivity. As with progressive image composition [3], switching
to lower resolution viewports would allow renderers to cover all the polygons
they are responsible for drawing while still keeping the frame rate high. Over
time polygons can dynamically be moved to achieve the high resolution output
imagery.
Of course, other metrics besides Coren’s acuity graph could be used
to direct the resolution and level of detail. In these cases, the foveated vision
system dealing strictly with user gazes can instead be generalized into a region
of interest (ROI) application. This region could be controlled via user input
from a wireless mouse or other input device instead of merely being taken from
gaze tracking hardware. The region of interest could also be modified by past
user history–keeping previous areas of interest in focus. Another characteristic
that could modify resolution and level of detail is prominent features in the
data set. Algorithms could detect high frequency changes in the data set and
bring those areas into closer focus since they could yield interesting informa-
tion. Distance from the user is also a trait that could be used to influence the
level of detail in a scene such as is done in Hoppe’s work [23].
121
9.3.2 Discrete Method
Applying the discrete method to the Metabuffer hardware makes sense
since the Metabuffer is able to generate viewports only in integer increments
of different resolution. Because of this limitation, instead of using Coren’s
complete graph as a queue for the level of resolution, individual points on
that graph are taken for each Metabuffer viewport resolution multiple. These
individual points are used to precompute a hierarchical mesh of the model to
be used in generating the scene.
For example, in the case shown in figure 9.2, the foveated vision ap-
plication using the Metabuffer employs three differently sized viewport. The
smallest viewport contains the highest resolution and is centered at the user’s
focus. This area corresponds to the peak in Coren’s acuity graph and will be
assigned the highest level of detail data set. The next larger viewport imple-
mented by the Metabuffer is in medium resolution. To find the level of detail
for this area, the highest acuity level covered by this area in Coren’s graph is
used. In this case, it would be about 20% of the detail of the high resolution
data set. Likewise, the largest and lowest resolution viewport implemented
by the Metabuffer uses a level of detail of approximately 10% as according to
Coren’s graph.
With polygon counts in the medium and low resolution viewports run-
ning 20% and 10% of the polygon counts in the high resolution viewport, it is
possible to match the greater number of rendering servers to the area of the
122
user’s focus. Using a cluster of rendering servers, 77% of these can be assigned
to generate the imagery for the high resolution high level of detail viewport.
Because the medium resolution viewport consists of only 20% of the polygons
as the high resolution viewport, only 15% of the machines are needed to ren-
der this area. Finally, since the lowest resolution consists of only 10% of the
polygons, only 8% of the machines are necessary to render the entire region in
a load balanced manner.
9.3.3 Load Balancing
The main problem in creating a foveated vision plugin for the Meta-
buffer system is how best to utilize the rendering resources available. They
should be organized in order to achieve the best degree of efficiency and the
fastest frame rates. The multiple parallel rendering machines need to be load
balanced no matter what viewpoint the user chooses. This organizational
problem is presented formally as follows.
Conditions
1. There exists a screen of n tiles and m rendering servers (m ≥ n). Each
tile has the same size of w × h pixels and each server has the same
rendering capability c triangles/second.
2. There are p triangles that project into the screen. We assume each
triangle takes the same amount of time to render.
123
Constraints
1. A high resolution w × h pixel area must be rendered where the user(s)
are gazing at all times. Regions surrounding this area can be rendered
in diminishing level of detail and resolution corresponding to the drop
off in rod and cone concentration in the peripheral view of the eye.
2. The data set could be extremely large, and thus all p triangles along with
the varying levels of detail of this triangle set must be evenly distributed
across all the machines. There cannot be a global data set that resides
on each machines.
3. The frame rate should be at least on par with the pm×c time possible
with the progressive image composition method. Taking into account the
diminished triangle count from decimated data sets, this means that the
rendering machines need to be fairly load balanced for any user viewpoint
even if the data set is almost certainly heterogeneously distributed across
the scene.
The goal is not only to find the best assignment of levels of detail of
data to renderers but also the best match of renderers to display space such
that the display is rendered in the shortest time.
In order to solve this problem, the multiresolution features of the meta-
buffer will be used extensively. In the case of a single user, viewports are
124
7 renderers9,124,090 polygons
1,303,441 polygons each
Medium Resolution2 renderers
530,053 polygons each1,060,106 polygons
Low Resolution
241,988 polygons
High Resolution1 renderer
Figure 9.2: Foveated pyramid for visible human example
125
arranged in a configuration analogous to Geisler’s “foveated pyramid”. Fig-
ure 9.2 shows the “foveated pyramid” for the visible human example in this
paper. High resolution viewports are located at the center of the user’s gaze.
Successively lower resolution viewports radiate out until the lowest resolution
viewport fills the entire display.
The ability to concentrate the rendering power of the Metabuffer in
the area of the user’s gaze is possible because of progressive meshes that have
been created by decimating data sets. The large low resolution viewports in
the periphery are required to render a much greater area that would normally
consist of a large amount of polygons. By using decimated data sets, however,
the quantity of polygons in this area can be much less than the number of
polygons contained in the small high resolution viewport. Therefore, a small
number of rendering servers can adequately render larger area.
Ensuring that the rendering servers are load balanced despite the user’s
viewpoint is achieved by assigning the triangles belonging to each progressive
mesh modulo the number of processors assigned to that mesh. This means
that the polygons for the data set are evenly distributed spatially among all
the processors. No matter where the user looks, all the processors will be
responsible for an even number of polygons. This is the technique used by
PixelFlow [12] to load balance its custom hardware even when dealing with
nonhomogeneous data sets.
126
9.3.4 Compositing
In order to merge the layers of multiresolution imagery together and
simulate the “foveated pyramid” using the Metabuffer, it is necessary to en-
sure that the higher level resolution imagery always takes precedence over
lower level resolution imagery. To do this, lower resolution rendering servers
remove portions of their viewports that will be covered by higher resolution
imagery using the hardware stencil buffer. With most of today’s graphics
cards, including the GeForce2 boards in our cluster, stencil tests are always
performed when doing Z buffer comparisons. Thus, the use of a stencil buffer
is essentially free in terms of performance cost. With the areas not covered by
the stencil vacant, pixels from high resolution renderers are free to be compos-
ited over these areas. This effectively performs a painters algorithm operation
using the existing architecture of the Metabuffer.
To allow for continuity, neighboring viewports of different resolutions
are allowed to overlap slightly. In these areas of overlap, dithering patterns are
performed. Again, this is done using the stencil buffer. Checkerboard patterns
are applied at the edges of the higher resolution viewport. By pushing the far
and near clipping planes slightly farther back for the neighboring low resolution
area, the border area between the two viewports consists of half higher and
half lower resolution data, but with a checkerboard mesh that is of the higher
resolution. This screen door transparency technique effectively smooths the
output image at the transitions between the higher and lower resolutions.
127
Blending this area masks discontinuities in the progressive meshes and in the
resolution changes.
9.3.5 Tracking
Tracking the movement of the retina typically is done using head mounted
displays with CCD cameras aimed into the eye. Until such a system is installed
in the visualization lab, the research in this paper uses a wireless visualization
device implemented using Compaq iPAQs running Windows CE and wireless
Ethernet to allow the user to input gaze areas and rotate and zoom the model.
9.4 Results
The configuration used to test the progressive image composition plu-
gin consisted of 19 machines in our visualization cluster. Each machine was
equipped with a high performance Hercules Prophet II graphics card, 256 MB
of RAM, an 800 MHz Pentium III processor and ran the Linux operation sys-
tem. 9 of the machines were set to actually emulate the Metabuffer hardware.
They performed the image compositing and output of the 3 by 3 tiled display
space. The other 10 machines were tasked with actually rendering the scenes.
All 19 machines were connected via 100 Mbps Fast Ethernet. We lim-
ited the test to 19 machines instead of the full 32 in the cluster with graphics
cards because the higher amounts of data transfer exceeded the capabilities
of the network and significantly slowed emulator performance. We anticipate
128
that the addition of Compaq’s ServerNet II to the cluster will greatly reduce
this constraint. The actual Metabuffer design, when put into hardware form,
eliminates this overhead entirely.
Three data sets are used to demonstrate the capabilities of the foveated
vision plugin for the Metabuffer: an isosurface of an engine block, a skeletal iso-
surface of the visible human, and a epidermal isosurface of the visible human.
Both contain progressive meshes generated by the fast isosurface extraction
system developed by Zhang [53].
Dataset Size Viewport RenderEngine 617,910 N/A 0.02 seconds
Skeleton 6,352,801 N/A 0.57 secondsVisible Human 9,128,798 N/A 0.81 seconds
Table 9.1: Foveated data set information
The statistics for each are shown in figure 9.1. Because we are doing
a simple even division of the data set among the processors, the time needed
to assign triangles to viewports does not really apply to the foveated vision
plugin. The render timings for each data set reflect the average time needed
to compute each frame in a 720 frame movie with the model being zoomed
and rotated. As the graphs later will show, the foveated vision plugin provides
constant frame rates no matter what the viewpoint, so these average timings
are in fact the frame times for any point in the movie.
Decimated data sets coupled with variable sized viewports means that
129
rendering servers can be concentrated at the user’s gaze. In the example
presented in this paper with a 3 by 3 tiled display and 10 renderers, 7 renderers
deal with the high resolution viewing area, 2 deal with the next larger area, and
1 works with the lowest resolution viewport covering the entire display. The
data set with the highest level of detail is divided evenly among the 7 machines.
The middle level of detail data set is divided among the 2. Finally, the lowest
level of detail data set is given to the one machine which is responsible for the
extreme periphery. This even division means that large data sets can be easily
used in the Metabuffer system. The large amount of memory on the cluster
as a whole is used collectively to store the polygon count.
9.4.1 Visible Human
In the case of the visible human data set, the highest resolution mesh
consists of 9,124,090 polygons. The medium resolution mesh consists of 1,060,106
polygons. Finally the lowest resolution mesh has only 241,988 polygons. Given
the processor assignments from above with the polygon counts from the pro-
gressive meshes of the visible human generated by the isosurface extraction,
the high resolution mesh is divided among 7 rendering servers resulting in
1,303,441 polygons per server. The medium resolution mesh is divided be-
tween 2 rendering servers giving 530,053 polygons per server. The low res-
olution mesh is assigned to one rendering server which is responsible for all
241,988 polygons. At first it may seem that these assignments are imbalanced,
but it is important to remember that, because the high resolution imagery will
130
only be drawn for one area of the display, not all of the polygons assigned to
the high resolution renderers will need to be drawn. This is true to a lesser
degree for the medium resolution polygons too. Because the polygons for all
the servers are distributed evenly across object space, different viewpoints or
zooms should not affect loading.
The images in figure 9.3 show 10 of the frames from a 720 frame movie.
At the beginning and end of the movie, the nine separate screens in the tiled
display split apart to reveal the geometry of the overall scene. In the middle
of the movie they join together to show how the unified display would look.
During the movie, the visible human data set is moved through a zoom
in and zoom out while being continually rotated. Meanwhile, the user’s gaze is
being tracked and that area is rendered in high resolution no matter what the
viewpoint. The user is not restricted to where he or she may look. Anywhere
in the entire display space is a valid place for the high resolution viewport.
Polygons are color coded according to which rendering server created
them. This gives the imagery within the high resolution viewport a mottled
appearance, since 7 rendering machines are responsible for this area. The
medium resolution viewport, on the other hand, only has two colors from the
two renderers that are assigned to it. Finally, the low resolution viewport is
being rendered by only one machine and thus is a solid green.
Notice that the display decreases in resolution and complexity according
131
Frame 4 Frame 119
Frame 255 Frame 352
Frame 360 Frame 367
Frame 452 Frame 557
Frame 643 Frame 715
Figure 9.3: Sample frames from the visible human movie
132
to the “foveated pyramid” of multiresolution viewports which are marked as
black rectangles. The level of detail differences in the progressive meshes
and the resolution differences are most noticeable in the zoomed in views.
For example, the fine detail of the lower torso of the human inside the high
resolution viewport contrasts with the less detailed data set being rendered by
the low resolution viewport of the leg in these views.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 100 200 300 400 500 600 700
Sec
onds
Frame Number
Foveated Visible Human Movie Timings
"Renderer0""Renderer1""Renderer2""Renderer3""Renderer4""Renderer5""Renderer6""Renderer7""Renderer8""Renderer9"
Figure 9.4: Rendering times for visible human movie frames
Timings from the movie are shown in figure 9.4. Because the polygons
are distributed evenly across the scene between the processors, all the timing
lines from the 720 frame movie are flat. No matter where the user looks or how
much he or she zooms into the scene, the load will always be the same. Because
133
a parallel application is only as fast as its slowest component, the frame rate
for this example using 10 rendering machines would be 0.81 seconds per frame.
However, because of the scalable nature of the Metabuffer architecture, adding
additional rendering machines only results in additional pixels worth of latency
and does not affect throughput. By applying 100 machines to render the same
example, the data set would be further reduced by a factor of 10 and so would
the rendering times. More rendering machines would result in similar increases
in frame rate.
9.4.2 Engine
For the engine data set, the highest resolution mesh consists of 617,910
polygons. The medium resolution mesh has 46,082 polygons and the lowest
resolution mesh consists of 10,728 polygons. With the processor configuration
of above. this means that the high resolution mesh is partitioned into units of
88,273 polygons, the medium resolution mesh is divided into 23,041 polygons
each, and the low resolution mesh is assigned to one processor responsible
for 10,728 polygons. Again, the polygon distributions are not even across
the resolution groups, but the majority of renderers (those rendering the high
resolution area) are completely balanced in terms of polygon count. Those high
resolution renderers will be the determining factor in frame timings, since they
are responsible for the largest polygon counts. Thus, the minority of renderers
should not adversely affect either the timings or the efficiency of the system.
134
Frame 2 Frame 78
Frame 145 Frame 213
Frame 286 Frame 360
Frame 439 Frame 516
Frame 596 Frame 677
Figure 9.5: Sample frames from the engine movie
135
Figure 9.5 shows 10 of the frames from the 720 frame movie created
using the engine data set. Just as with the visible human example, the data
set is zoomed in and out while constantly being rotated. The region of interest
controlled by the user is constantly in high resolution with the periphery falling
off in detail according to Coren’s model and Geisler’s “foveated pyramid”.
Again, each viewport is color coded according to the renderer that drew it.
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 100 200 300 400 500 600 700
Sec
onds
Frame Number
Foveated Engine Movie Timings
"Renderer0""Renderer1""Renderer2""Renderer3""Renderer4""Renderer5""Renderer6""Renderer7""Renderer8""Renderer9"
Figure 9.6: Rendering times for engine movie frames
The timings for the engine movie, set to the same scale as the visible
human movie for comparison, are shown in figure 9.6. Again, just as with
the visible human, the timings are flat no matter what viewpoint or region of
interest is chosen. In the case of the engine model, the 10 machines used in the
136
rendering of the frames are more than enough to create 30 frames per second.
Again, if this were not the case, the Metabuffer architecture is easily scalable
to allow for more rendering machines which will subdivide the polygon count
further and allow for faster frame times.
9.4.3 Skeleton
With the skeletal data set, the foveated vision plugin for the Metabuffer
behaves just as the two examples. The skeletal data set consisted of 6,352,801
polygons in the high resolution viewport split over 7 processors resulting in
907,543 polygons per processor. For the medium resolution level of detail,
there were 664,528 polygons split over two processors giving 332,264 polygons
per processor. Finally, in the lowest resolution level of detail there were only
138,594 polygons assigned to a single machine.
Figure 9.7 shows the frames from the movie made from the skeletal
data set. Again, the model is zoomed in and zoomed out while being rotated.
The foveated area is moved around the screen, revealing a constant area of
high resolution. The rest of the display falls off in resolution as prescribed by
the “foveated pyramid”. As with the other two examples, each viewport is
color coded according to the renderer that drew it.
The timings for the skeletal data set shown in figure 9.8 mirror the
results of the other two examples. All timings are flat regardless of the frame
number. The majority of renderers are balanced and grouped in the highest
137
Frame 3 Frame 100
Frame 199 Frame 300
Frame 341 Frame 384
Frame 481 Frame 591
Frame 685 Frame 718
Figure 9.7: Sample frames from the skeleton movie
138
0
0.1
0.2
0.3
0.4
0.5
0.6
0.7
0.8
0 100 200 300 400 500 600 700
Sec
onds
Frame Number
Foveated Skeleton Movie Timings
"Renderer0""Renderer1""Renderer2""Renderer3""Renderer4""Renderer5""Renderer6""Renderer7""Renderer8""Renderer9"
Figure 9.8: Rendering times for skeleton movie frames
139
time line. The minority of renderers responsible for the medium and low
resolution areas of the screen are in the second and third highest respectively.
One possible criticism of the technique as presented in this paper is
that all of the rendering machines are not completely load balanced. While
this is not very obvious in the case of the engine model, from the graph of
the visible human in figure 9.4 it is evident that the timing lines are clumped
into three groupings. The first, at 0.81 seconds per frame, are the 7 renderers
that are doing the high resolution viewport. The second, at 0.27 seconds per
frame, are the 2 renderers doing the medium resolution viewport. Finally, at
0.11 seconds per frame is the single renderer responsible for the low resolution
viewport. While the renderers in each of these groups are load balanced among
themselves, as a whole they are not evenly balanced.
This should not be a concern. The majority of rendering servers are
assigned to the high resolution viewport and are load balanced among them-
selves. The minority of rendering servers doing the low and medium resolution
viewports may not have as much work to do, but because of their small num-
ber they will not greatly erode the overall efficiency of the algorithm. As
long as the workload assigned to the low and medium resolution viewports by
the progressive mesh is less than the workload of the primary high resolution
rendering servers, these few low and medium resolution renderers will always
be faster than the high resolution renderers. Thus, this imbalance will not
adversely affect the overall parallel timings.
140
9.5 Conclusion
The flat line timings of the foveated vision algorithm presented here
provide consistent frame rates no matter what the user viewpoint. However,
these initial results should be looked on only as the worst case timings possible
with this technique. Much faster timings than these could be possible with
efficient frustum culling.
Even though enough data is stored on all the machines in the system to
render each level of detail mesh in its entirety, obviously only a small portion
of those data sets is rendered at any one frame. This is because the majority
of those data sets are located outside of their area of their viewport for that
particular viewing angle. To avoid rendering these polygons, it is necessary to
employ a very efficient frustum culling algorithm. The frustum culler checks
polygons against the boundaries of the viewing area and eliminates extraneous
polygons from being sent to the OpenGL rendering stream. The more efficient
the algorithm, the better the speedup the foveated vision plugin will achieve.
Assarsson [1] discusses many of the methods used in fast frustum culling.
Employing efficient culling would improve the overall frame rate of the system.
One issue with frustum culling is the imbalances that could exists
among the different resolution viewports. For example, only machines that
have a particular sized decimated data set can render polygons to the cor-
respondingly sized viewport. In the example, this effectively means that the
cluster of rendering servers has been split into three groups, a high resolution
141
group, a medium resolution group, and a low resolution group. Because mem-
bers of these groups can not easily shift to help relieve loading pressures, in
some instances load imbalances will result. However, the worst case scenario
is if the user is looking at a region consisting of no polygons. In that case, the
majority of the rendering servers are rendering nothing. Even so, the medium
and low resolution rendering servers can at most render all the polygons which
they are assigned. Since the polygon count drops off exponentially this count
will still be bounded to a worst case frame rate. In the case of the example
presented in this paper, that upper bound would be 0.27 seconds.
This paper discusses foveated vision using a single user. In order to
support multiple viewers with multiple gazes, replication is necessary. Because
of the modulo distribution of polygons among the rendering server, a single
distributed data set can only render one viewport area. Trying to render
another viewport with that same data set would result in some polygons being
unavailable. To cope with this, it is necessary to have copies of each decimated
data set (except for the lowest resolution data set which covers the entire
display, of course) and a set of dedicated machines for each viewer. Replication
is typically not a good attribute to have when dealing with large data sets,
but considering that the number of users will typically be much lower than the
available machines that can render, this duplication does present an inordinate
problem for memory requirements.
142
Chapter 10
Conclusion and Future Work
This dissertation describes the architecture for a multiresolution mul-
tidisplay image composition system. It presents a simulator and emulator
for this architecture as a testbed. Finally, it illustrates the usefulness of
multiresolution for achieving high interactivity in parallel multidisplay image
compositing systems by providing two applications illustrating multiresolution
techniques with constant frame rates.
10.1 Summary
High resolution imagery and frame rates are usually a tradeoff in most
visualization applications. High resolution requires more computation time
yielding slower frame rates. Low resolution requires less computing power and
gives a faster display, but the image quality is not as good. A primary issue
is managing the balance between high resolution and frame rate in order to
provide the best interactivity for the user (chapter 1). The thesis of this dis-
sertation is that multiresolution can manage this balance effectively resulting
in higher levels of user interactivity than possible with other systems that do
143
not exploit this feature (chapter 2).
The multiresolution features of the Metabuffer supports adjusting the
tradeoff between resolution and frame rate in a dynamic manner by allowing
varying levels of detail and resolution in the same image (chapter 3). Machines
in a Metabuffer equipped rendering cluster can send their output imagery to
anywhere within the entire multitiled display space in the form of a viewport.
These viewports may overlap and can be of any resolution multiple.
To demonstrate that the architecture is viable, a simulator was written
which emulates the Metabuffer at the level of the bus clock tick (chapter 4).
Running test scenes using the simulator shows that the Metabuffer architecture
is able to generate glitch free output imagery. The bandwidth requirements
for any frame are constant throughout the entire compositing process and
thus do not overload the bus and starve any of the compositing pipelines.
The simulator also shows the capabilities of the antialiasing and transparency
features of the Metabuffer.
For dealing with more complex applications of the Metabuffer, an emu-
lator is demonstrated that mimics the working of the Metabuffer but is geared
to running on its host architecture as fast as possible (chapter 5). This allows
for a level of interactivity not possible with the simulator. The emulator is
currently running on a cluster of Linux machines connected to a 5 by 2 tiled
display wall in the visualization laboratory. However, the use of cross platform
libraries for communication and display needs should allow it to be ported to
144
almost any platform.
To support the Metabuffer emulator, a partitioning scheme is shown
that divides models into smaller groupings of triangles which can then be sent
to the individual rendering machines on the cluster (chapter 6). The emulator
is also provided with a wireless visualization control device implemented under
Windows CE using wireless Ethernet (chapter 7). The wireless devices allow
the user to have a high level of control over the emulator.
With the emulator in place, applications can be developed for the Meta-
buffer using its simple plugin API. These applications, once developed for the
emulator, will be easy to move to the Metabuffer hardware when it is imple-
mented. Two such applications are shown in this dissertation which exploit
the features of Metabuffer. The multiresolution capabilities of the Metabuffer
allow for a great deal of flexibility in allocating rendering resources to portions
of the screen. This is used in the foveated vision plugin (chapter 9) to concen-
trate the greatest level of detail and the most rendering machines where the
user is focused. The larger peripheral area is rendered in lower resolution and
detail with fewer machines.
The multiresolution features of the Metabuffer also assist greatly in
managing communication needs. This is demonstrated in the progressive im-
age composition plugin (chapter 8). Even in the face of fast changing user
viewpoints, frame rates remain steady. The rendering machines can switch
to lower resolutions instead of being forced to move polygon data or imagery
145
through the network. When the user stops to study an area, communica-
tion can then take place without penalty to interactivity and provide higher
resolution output.
10.2 Limitations of the Metabuffer
Because the bandwidth requirements of the Metabuffer must be even
throughout the entire rasterization of the display, there are limitations on the
sizes of the viewports that can be used. Viewports can only be of integer
multiples. This is because the bandwidth needs are lessened by using pixel
replication on the composer nodes. If the resolution multiples are not integer
values, pixel replication will not be as effective and this will mean higher
bandwidth needs that in some cases may swamp the bus. Also, if a viewport
is in low resolution, it can only be positioned on a location that is a multiple
of that resolution. Again, this is to assist pixel replication.
Pixel replication in general is another limitation of the Metabuffer.
While fast and simple, it yields blockiness at very low resolutions. To achieve
higher quality low resolution images some form of linear interpolation should be
used to smooth the replicated pixels. This would add greatly to the complexity
of the Metabuffer hardware but would not be impossible to add.
The Metabuffer requires non-trivial custom hardware in order to im-
plement. In this regard, the SHRIMP project which uses a standard cluster
and the Sepia project which uses COTS components would be easier to deploy.
146
However, the Lightning-2 board developed by Intel shares the same basic ar-
chitecture as the Metabuffer. By reprogramming the compositing nodes of this
board to have an on-board cache and do pixel replication it should be possible
to yield the multiresolution features of the Metabuffer. Unfortunately, there
are very few details about the Lightning-2 available to the public.
10.3 Limitations of the Applications
The demonstration applications included in this dissertation are in-
tended to show that the multiresolution features of the Metabuffer help keep
frame rates fast and consistent. In this regard, I feel that they have served
their purpose well, but there are many improvements that could be made to
both the progressive image composition and foveated vision plugin.
For the progressive image composition plugin, a polygon moving method
needs to be fully implemented in order to allow the renderers to generate high
resolution imagery over time. A strategy for this was outlined using a server
to keep track of blocks of polygons but has not been deployed.
In the case of the foveated vision plugin, there is currently no multiuser
support. In order to support multiple users, another set of machines with a
replicated data set would have to be devoted to that user as outlined in the
conclusion of chapter 9. This should not be difficult to implement.
147
10.4 Future Work
There are many avenues for future work on the Metabuffer project.
These areas can be divided into work on the hardware, the applications, and
the user interface.
For the hardware, the primary goal would be to create an actual pro-
totype. One way to do this would be to obtain the design of the current
Lightning-2 board and reprogram its FPGAs to reflect the operation of the
Metabuffer. Any changes needed to the Lightning-2 layout should be very
minimal, as the two architectures share the same crossbar design. The other
way would be to do an original design with the assistance of an outside vendor.
The Metabuffer hardware would need to be a high speed design in order to
keep up with the output of the graphics cards.
For the applications, the implementation of a polygon redistribution
server is still needed for the progressive image compositing plugin. This server
would direct the rendering machines where and when to send polygon data
between one another in order to create high resolution viewports for a partic-
ular viewpoint. For the foveated vision plugin, as stated previously, additional
work needs to be done in order to support multiple users.
There is a great deal of interest from Sandia National Labs to get the
Metabuffer adapted to use the WireGL, or Chromium, API from Stanford.
WireGL is essentially an extension of OpenGL which distributes polygons
148
from existing applications seemlessly to a cluster of rendering computers. It
is discussed in chapter 2.
Finally, many exciting uses could be explored with the wireless visual-
ization device. Currently the device only sends data to the cluster, but there
is no reason why the cluster could not transmit information back to the device
in order to give the user more visual queues or other information. There are
certainly many more topics to be dealt with in this area.
10.5 Conclusion
The benefits of multiresolution techniques vary in usefulness. Certainly
for the cases in which image quality is paramount, multiresolution techniques
will not be a valid option. However, for situations in which user interactivity
is an overriding concern, and rendering loads are large because of data set size
or complexity, multiresolution does provide fast, consistent frame rates when
used in the context of a parallel, multidisplay image compositing system such
as the Metabuffer.
149
Appendix
150
Appendix A
Simulator Classes
A.0.1 Class CClock
Public Members
CClock (int numthreads)
In the constructor the number ofthreads for the barrier is specified
~CClock ()
bool HL () Enter high to low clock transition
bool LH () Enter low to high clock transition
bool OutputReset () This function initiates a systemreset.
bool ReadOutputReset ()
This function reports if a reset istaking place.
Private Members
1.1 CClock Clock Transitions . . . . . . . . . . . . . . . . . . . . . . 152
151
1.2 CClock Reset Line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152
The CClock class emulates the rising and falling edge of the hardware
clock. It uses a barrier in order to synchronize the individual threads from
each component, just as the hardware would be synchronized with the clock
signal.
CClock Clock Transitions (1.1)
Names
int mnumthreads
CBarrier* HLBarrier
CBarrier* LHBarrier
In order to simulate the rising and falling edges of the clock, two barriers
constructed with pthreads primitives are used. One controls the high to low
transition and the other controls the low to high transition. The variable
mnumthreads specifies how many threads are to be blocked at both barriers
and is set in the constructor for CClock.
CClock Reset Line (1.2)
Names
152
bool mprevoutputreset
bool moutputreset
The reset line allows the system to be initialized. It is implemented here
because CClock is accessible to all the COutFrames which bubble it up the
pipeline. This isn’t very elegant, but it eliminates the need for another class
or additional code in the component classes. Two variables are used to keep
track of it. The variable moutputreset holds the currently latched state and
mprevoutputreset holds the newly latched state.
A.0.2 Class CComposerPipe
Public Members
CComposerPipe (ulong renderer, ulong display,CInFrameBus *bus,CComposerPipe *prev,CClock *clock)
Set up the position of this com-poser in the Metabuffer
~CComposerPipe ()
bool ReadPipe (ulong *highdata, ulong *lowdata,bool *control)
This function fetches data fromthe previous composer in thepipeline.
bool SetPipeReady (bool pipeready)
153
This function is called by the fol-lowing composer to bubble thiscomposer’s pipeready.
bool SetPipeReset (bool pipereset)
This function is called by the fol-lowing composer to bubble up thiscomposer’s pipereset.
void DoBusIO () Perform housekeeping tasks formonitoring the bus
Private Members
long mticks Stores the number of clock ticksthat have occurred.
CComposerQueue*
mqueue The queue used for pixel replica-tion.
2.1 CComposerPipe Composer Position . . . . . . . . . . 155
2.2 CComposerPipe Pipeline Readable Data . . . . 156
2.3 CComposerPipe Pipeline Writable Data . . . . . 156
2.4 CComposerPipe Bus Variables . . . . . . . . . . . . . . . . 157
2.5 CComposerPipe Pipeline Variables . . . . . . . . . . . 158
2.6 CComposerPipe Thread Functions . . . . . . . . . . . 158
The CComposerPipe class simulates the composers in the pipeline. It
takes data in from the CInFrameBus and, if it is responsible for a pixel in the
154
display, compares that to data coming down the compositing pipeline from
previous CComposerPipes. A lot of this code implements the operations of
the pipeline. Many of the variables are in pairs to simulate the latching of
data. In the constructor, the CComposerPipe is initialized. It is given which
renderer (row) it is responsible for and which display (column) it is creating
a pipeline to drive. It is given a pointer to the renderer’s CInFrameBus class
in order to grab data off the bus as well as a pointer to the CComposerPipe
above it in order to communicate data on the pipeline. Finally, a pointer to
the global CClock is given for clock transitions.
CComposerPipe Composer Position (2.1)
Names
ulong mrenderer
ulong mdisplay
CInFrameBus*
mbus
CComposerPipe*
mprev
CClock* mclock
In order to communicate correctly with the other components in a Metabuffer,
it is necessary to know where this instance of the composer has been placed
and how to talk to the other components in the system. These values are
155
initialized in the constructor. Here the number of the renderer and display are
recorded. Pointers also exist to the previous CComposerPipe in the pipeline
and the CInFrameBus for data exchange. Finally, the CClock is included for
clock transitions.
CComposerPipe Pipeline Readable Data (2.2)
Names
ulong mhighpipe
ulong mlowpipe
bool mcontrol
This data is the latched in data owned by this instance of the CComposer-
Pipe. It consists of the mhighpipe and mlowpipe values which normally store
RGB and Z information, along with an mcontrol bit which specifies if control
information is being sent over highpipe and lowpipe instead.
CComposerPipe Pipeline Writable Data (2.3)
Names
bool mpipeready
bool mprevpipeready
bool mpipereset
bool mprevpipereset
156
In order to simulate a latch in of the pipeready bit as it is being bubbled up
the pipeline, two values are used. The bit mpipeready is the new value and
mprevpipeready is the currently latched in value. A similar convention is used
for mpipereset and mprevpipereset.
CComposerPipe Bus Variables (2.4)
Names
bool mbusready This value is used to tell the busto abort a send using IRSA.
ulong mbusstate This value keeps track of what op-eration the bus is currently per-forming.
bool msendingviewports
State bit to identify when view-ports are being transmitted overthe bus.
ulong mviewindex Value to keep track of viewportcopying
VIEWPORT
mNewViewPort Datastructure used to store theviewport that the composer is re-sponsible for.
Several variables are used in communicating with the bus controlled by the
CInFrameBus instance. The bit mbusready is used by the composer to deter-
mine if an IRSA needs to be sent to the CInFrameBus. The variable mbusstate
157
keeps track of the current bus operation. The bit msendingviewports keeps
track of whether the bus is currently sending viewports over the bus. During
this period, the viewport that the composer is responsible for may be sent.
The variables mviewindex and mNewViewPort are used to copy the viewport
to the composer’s local memory.
CComposerPipe Pipeline Variables (2.5)
Names
int mstate
ulong mdispcoords
These variables control the operation of the pipeline in the composer. The
variable mstate is the current condition of the pipeline. It tells whether the
composer should be transmitting data, waiting for a pipeready to bubble up,
etc. The variable mdispcoords records the overall location in the display. The
composer checks against this variable to determine if its viewport is currently
within the correct range to send pixels.
CComposerPipe Thread Functions (2.6)
Names
DWORD ThreadProc ()
static void*
158
StaticThreadProc ( void * parg )
DWORD StartThread ()
To allow each class its own thread to run in, a few special calls need to be
implemented in C++. StartThread is called from the constructor and creates
the thread. In order for the system to be able to call back into the class, a
static function needs to be defined called StaticThreadProc. StaticThreadProc
takes the class instance as an argument and then calls back into ThreadProc.
A.0.3 Class CComposerQueue
Public Members
CComposerQueue ()
~CComposerQueue ()
bool Get (ulong x, ulong y, ulong *highdata,ulong *lowdata)
This function provides pixels fromthe queue if x and y are in theviewport.
bool Put (ulong highdata, ulong lowdata)
Put the data received from the businto the queue when it belongs tothe composer.
bool BufferIsPrefetched (VIEWPORT *vp)
159
This function assigns the newviewport and makes sure thequeue is full before starting.
void Reset () Clears out the queue. Calledwhen a reset signal bubbles up thepipeline.
Private Members
ulong* mbuffer This is the buffer allocated to holdthe FIFO queue.
ulong mbuffstart The start of the queue.
ulong mbuffend The tail end of the queue. Notethat some room is left for old datatoo!
long mbufflen The amount of data stored in thequeue.
VIEWPORT
mViewPort The current viewport that is beingworked with.
VIEWPROGRESS
mViewProgress How much progress has been madewith the current viewport.
bool BufferFull () If bufflen is greater than the sizeof buffer this is TRUE.
bool AllDataFetched (VIEWPORT *vp)
If the buffer is full entirely withdata from the current viewportthis is TRUE.
160
The CComposerQueue class is a special version of a FIFO queue. Es-
sentially it acts like a normal queue except for one important distinction. The
data elements of the queue can be accessed (but not removed) from the queue
at any time. This allows the CComposerPipe classes to do pixel replication.
The queue buffers data coming into the CComposerPipe so that multiple data
accesses for multiresolution are not a problem. It also saves at least one line
of previous imagery so that replication can be done by accessing those old
members.
A.0.4 Class CInFrameBus
Public Members
CInFrameBus (int renderer, CClock *clock)
Set up the position of this CIn-FrameBus in the Metabuffer
~CInFrameBus ()
bool LoadFrame (char *szTIF, char *szZ,ulong NumViewports,VIEWPORT *ViewportArray,BOOL bShowViewport, int count)
161
This function reads in new im-agery from the disk.
bool LoadViewsWithoutFrame ( ulongNumViewports,VIEWPORT*ViewportArray,int count)
This function is mainly used fortesting viewport locations.
bool ReadBus (ulong *highdata, ulong *lowdata,bool *control)
Called by the composers to fetchdata from the bus..
bool SetBusReady (bool busready)
Called by the composers to pulldown the busready bit.
bool SetBusReset (bool busreset)
Called by the composers to pulldown the busreset bit.
bool SetMastersSynced (bool masterssynced)
Called by the composers to pulldown the masterssynced bit.
bool GetMastersSynched ()
If no composer has pulled it down,things are synced!
Private Members
long mticks Stores the number of clock ticksthat have occurred.
4.1 Viewport Information . . . . . . . . . . . . . . . . . . . . . . . . . . 163
162
4.2 CInFrameBus Bus Variables . . . . . . . . . . . . . . . . . . . 164
4.3 CInFrameBus Position . . . . . . . . . . . . . . . . . . . . . . . . . 165
4.4 CInFrameBus Double Buffering . . . . . . . . . . . . . . . 165
4.5 CInFrameBus Bus Readable Data . . . . . . . . . . . . 167
4.6 CInFrameBus Bus Writable Data . . . . . . . . . . . . . 167
4.7 CInFrameBus Thread Functions . . . . . . . . . . . . . . 168
The CInFrameBus represents the graphics card sending data to the
double buffered viewport which then transmits it to the composers over the
bus. The constructor for this class specifies the renderer that it is responsible
for and also gives a pointer to the global clock for clock transitions.
Viewport Information (4.1)
Names
VIEWPORT
mViewPortArray [10]
VIEWPROGRESS
mViewProgressArray [10]
VIEWPROGRESS
163
mViewProgress1
ulong mviewportindex1
ulong mnumviewports
ulong mviewportindex
These variables are used to store viewport information recorded from the en-
coding on the imagery, as well as record the progress of data sent to the
composers for each viewport. The variable mViewPortArray stores the actual
viewport for each display. The variable mViewProgressArray shows how each
viewport has been serviced. The variables mViewProgress1, mViewProgress2,
mviewportindex1, and mviewportindex2 are implemented as roll back mech-
anisms when an IRSA event occurs. A few pixels will be dropped in these
cases, so it is necessary to always store the state of the last two operations.
CInFrameBus Bus Variables (4.2)
Names
int mstate
bool mgoaheadandsend
ulong msendindex
ulong msendlength
164
In order to keep track of the operations of the bus, a few variables are needed
to store state. The variable state tells what operation the bus is in currently.
The bit goaheadandsend means that the frame buffer has been loaded and
swapped for the next image send. The variables sendindex and sendlength are
both used to assist in transmitting viewport structures.
CInFrameBus Position (4.3)
Names
int mrenderer
CClock* mclock
In order to communicate correctly with the other components in a Metabuffer,
it is necessary to know where this instance of the composer has been placed.
These values are initialized in the constructor. Here the number of the renderer
is recorded. The CClock is included for clock transitions.
CInFrameBus Double Buffering (4.4)
Names
unsigned char*
mbuff1
unsigned char*
165
mbuff2
unsigned char*
minbuff
unsigned char*
moutbuff
unsigned char*
mzbuff1
unsigned char*
mzbuff2
unsigned char*
minzbuff
unsigned char*
moutzbuff
CMutex* DoubleBuffMutex
ulong mframecount
One of the main jobs of the CInFrameBus is to double buffer the input imagery
from the graphics cards. The composers require that the input imagery be
accessed in a random fashion. Since DVI only provides the data in raster line
order, a full screen must be double buffered. These variables achieve that.
Note that Lightning-2 avoids this by rearranging the screen on the graphics
card for the proper ordering. This results in a loss of throughput, but allows
for much simpler hardware and may actually be the best way to implement
166
this.
CInFrameBus Bus Readable Data (4.5)
Names
ulong mhighbus
ulong mlowbus
bool mcontrol
These values are placed on the bus by this instance of CInFrameBus for the
other composers on the bus to read. The variables mhighbus and mlowbus are
typically used to transmit RGB and Z information, although the mcontrol bit
can specify that control information is being passed instead.
CInFrameBus Bus Writable Data (4.6)
Names
bool mbusready
bool mprevbusready
bool mbusreset
bool mprevbusreset
bool mmasterssynced
bool mprevmasterssynced
167
In order to simulate a pulldown line on the bus, pairs of variables are used
for mbusready, mbusreset, and mmasterssynced. Each pair consists of a new
value as a result of a pulldown, and a currently latched value.
CInFrameBus Thread Functions (4.7)
Names
DWORD ThreadProc ()
static void*
StaticThreadProc ( void *parg )
DWORD StartThread ()
To allow each class its own thread to run in, a few special calls need to be im-
plemented in C++. StartThread is called from the constructor and creates the
thread. For the system to be able to call back into the class, a static function
needs to be defined called StaticThreadProc. StaticThreadProc takes the class
instance as an argument and then calls back into ThreadProc.
A.0.5 Class COutFrame
Public Members
COutFrame (int display, CComposerPipe *prev,CClock *clock)
168
Initialize the position of the framebuffer
~COutFrame ()
bool SaveImage (char *szTIF)
This function saves the framebuffer to a TIF image.
Private Members
long mticks Stores the number of clock ticksthat have occurred.
5.1 COutFrame Position . . . . . . . . . . . . . . . . . . . . . . . . . . . 170
5.2 COutFrame Frame Buffer Variables . . . . . . . . . . 170
5.3 COutFrame Pipeline Variables . . . . . . . . . . . . . . . . 171
5.4 COutFrame TIF Variables . . . . . . . . . . . . . . . . . . . . . 172
5.5 COutFrame Queue Variables . . . . . . . . . . . . . . . . . . 172
5.6 COutFrame Thread Functions . . . . . . . . . . . . . . . . . 173
The COutFrame class simulates the output frame buffer. At the end
of each compositing pipeline, it is responsible for gathering the composited
imagery, down-sampling by averaging the 4 neighboring pixels (for supersam-
pling), and then displaying it on the tiled monitors. In the constructor COut-
Frame is given which display it is, the pointer to the CComposerPipe directly
above it (so that it can read data from the pipeline and send data back up),
169
and the global clock (so that it can enter into clock transitions and read the
reset line.
COutFrame Position (5.1)
Names
int mdisplay Display number as specified byconstructor.
CComposerPipe*
mprev CComposerPipe directly aboveframe buffer for pipeline readsand writes.
CClock* mclock Global clock for clock transitionsand RESET line information.
In order to communicate correctly with the other components in a Metabuffer,
it is necessary to know where this instance of the COutFrame has been placed
and how to talk to the other components in the system. These values are
initialized in the constructor. Here the number of the display is recorded.
Pointers also exist to the previous CComposerPipe in the pipeline for data
exchange. Finally, the CClock is included for clock transitions.
COutFrame Frame Buffer Variables (5.2)
Names
unsigned char*
170
mbuff
ulong mbuffindex
All of that data has to go somewhere! The buffer mbuff stores the output
image in the frame buffer. The variable mbuffindex is the current index into
the frame buffer as the data comes out in raster line order.
COutFrame Pipeline Variables (5.3)
Names
bool mpipeready The value of the PIPEREADYsignal on the pipeline.
bool mpipereset The value of the PIPERESETsignal the frame buffer will bubbleup the pipeline.
bool mwaitforsentinel
Keeps track of when a frame hasbeen finished but the next hasn’tstarted.
Several variables are used in communicating with the pipeline controlled by
the CComposerPipe instance. The bit mpipeready is bubbled up the pipeline
when the frame buffer is ready for more data. Likewise, mpipereset is bubbled
up the pipeline when a reset has been detected from the CClock instance.
The bit mwaitforsentinel keeps track of when a pipeready has been sent up
the pipeline but an acknowledgement that the composers are synced and ready
171
to send hasn’t been received.
COutFrame TIF Variables (5.4)
Names
CMutex* SaveMutex
int mpicindex
The libtif library isn’t thread safe, so this CMutex is used to guard against
corrupting any of its internal data structures. The variable mpicindex is the
index used to mark the name of the TIF file to save.
COutFrame Queue Variables (5.5)
Names
unsigned char*
mqueue
ulong mqueueindex
In order to down-sample the four neighboring pixels into one supersampled
pixel, it is necessary to store the previous line of pixels. These variables form
a queue that always has the last line of pixels in memory.
172
COutFrame Thread Functions (5.6)
Names
DWORD ThreadProc ()
static void*
StaticThreadProc ( void *parg )
DWORD StartThread ()
To allow each class its own thread to run in, a few special calls need to be
implemented in C++. StartThread is called from the constructor and creates
the thread. For the system to be able to call back into the class, a static func-
tion needs to be defined called StaticThreadProc. StaticThreadProc takes the
class instance as an argument and then calls back into ThreadProc.
173
Appendix B
Emulator Distribution
B.1 Contents
There are two main tar files in this distribution:
• meta.tar.gz
• metadata.tar.gz
Meta.tar.gz contains a slightly modified version of the GLUT library,
the TIFF library, the OCview library, the Metabuffer emulator code, a plugin
directory containing some different examples of plugins for the Metabuffer
emulator, and a tools directory with metaload.c to split large data sets into
pieces using the greedy viewport allocation algorithm for the progressive image
composition plugin, metascatter.c which divides a data set in a modulo manner
suitable for processing with the foveated vision plugin, and metapaste.c to
piece the tiled display output images back together to form movies.
The initial plugin.cpp file in the Metabuffer emulator code (teapot.cpp)
does not rely on the metadata.tar.gz contents. To try either progressive.cpp
174
(progressive image composition), fovea.cpp (foveated vision), or the simpler
ducksetal.cpp (a couple OCview models bouncing around the display), the
metadata file is needed.
B.2 Building the Metabuffer Emulator
In order to create the Metabuffer emulator it is necessary to follow this
build process. Because some of the libraries have dependencies on the others,
build the libraries in this order.
B.2.1 glut-3.7
This is a slightly modified version of the GLUT library. The main
changes here are the addition of a call back function, glutMainLoopUpdate(),
in glut event.c to process GLUT commands in the single threaded MPICH
processes instead of resorting to the glutMainLoop() endless loop. The code
is shown below in case an updated version of GLUT needs to be modified.
void APIENTRY
glutMainLoop(void)
{
for(;;)
glutMainLoopUpdate();
}
175
/* CENTRY */
void APIENTRY
glutMainLoopUpdate(void)
{
#if !defined(_WIN32)
if (!__glutDisplay)
__glutFatalUsage("main loop entered with out proper
initialization.");
#endif
if (!__glutWindowListSize)
__glutFatalUsage(
"main loop entered with no windows created.");
{
if (__glutWindowWorkList) {
GLUTwindow *remainder, *work;
work = __glutWindowWorkList;
__glutWindowWorkList = NULL;
if (work) {
remainder = processWindowWorkList(work);
if (remainder) {
*beforeEnd = __glutWindowWorkList;
__glutWindowWorkList = remainder;
}
}
}
if (__glutIdleFunc || __glutWindowWorkList) {
176
idleWait();
} else {
if (__glutTimerList) {
waitForSomething();
} else {
processEventsAndTimeouts();
}
}
}
}
/* ENDCENTRY */
The function glutMainLoopUpdate() is used instead of the standard
GLUT message loop that is called at the very end of most GLUT programs.
Essentially it includes all the glutMainLoop() code except for the endless for
loop. The glutMainLoop() function remains for completeness. This addition
was done because the version of MPICH used on the Prism cluster does not
support multiple threads in a process. Therefore, the call back glutMainLoop-
Update() function is called periodically by the main thread to process the
GLUT messages.
The makefile here is modified and configured to generate a static library
instead of the shared library that GLUT normally would create. Because of
the GLUT modification above, this version probably should not be installed
as the system version of GLUT. This way only the Metabuffer emulator will
177
be linked to it.
To create the GLUT library, go into the glut-3.7/lib/glut directory and
type:
make
This will generate the libglutmpi.a file that is the static library the
Metabuffer emulator will link against.
B.2.2 tiff-v3.5.5
This is an unmodified standard distribution of the TIFF image library.
It is used to save output from the Metabuffer emulator for remote debugging,
to generate movies, or to make images for papers or reports.
To create the TIFF library, go into the tiff-v3.5.5/libtiff directory and
type:
make
mv libtiff.a libtiffz.a
This will generate the libtiff.a file that is the static library the Meta-
buffer will link against. Rename this file libtiffz.a in order for the Metabuffer
makefile to work with it (for some reason the Prism machines were aliasing
this with another tiff library).
178
B.2.3 ocview
OCview is an out of core renderer developed at the University of Texas
at Austin. Currently Xiaoyu Zhang maintains it. OCview allows images to
be generated from data that can be much larger than the amount of memory
in the system by fetching that data from secondary storage. For most runs,
usually the data is kept small enough to just fit within system memory. Still,
this capability exists for even larger data sets when there are not enough
machines available for splitting the data set.
To create the OCview library, go into the ocview directory and type:
make
B.2.4 emu
After the previous three libraries have been built, it is now time to
build the actual emulator. First, it is necessary to tell the emulator code how
the system is laid out. Go into the emu directory and edit the enviro.h file.
• Set DISPX and DISPY to the resolution of the rendering and display
machines. At UT, this is 800 by 600.
• Set NUMOUTX and NUMOUTY to the tiling configuration of the pro-
jectors. The UT visualization lab has a 5 by 2 tiled display wall.
179
• Set NUMINPUTS to the number of rendering machines that are being
used. All the plugins in this distribution rely on 10 rendering machines,
though they should work with varying tile configurations. If 10 machines
aren’t available, a few changes to the plugin.cpp code might be needed.
• Set szBindings to the names (gethostname()) of the machines that drive
the display. Order these names left to right, top to bottom according to
how they are laid out in the tiled display wall.
• Set szHomeDir to the directory that contains the meta and metadata
directories. This is used for the progressive.cpp, fovea.cpp, and duckse-
tal.cpp plugins in order to find their data sets. In the distribution it is
in the ˜wjb home directory.
After that is done type:
make
This will create the Metabuffer emulator (meta.exe) and link it to the
previous three libraries.
B.3 Running the Metabuffer Emulator
In order to run the Metabuffer emulator at UT, type:
180
mpirun.mpich -arch PROJ -np 10 \
-arch NONPROJ -np 10 \
/home/wjb/meta/emu/meta
On the UT cluster, the mpirun command is named mpirun.mpich. Oth-
ers may be different. The -arch commands specify a machine list file (on the
Prism cluster located in /usr/lib/mpich/util/machines). machines.PROJ con-
tains a list of the 10 machines hooked up to the projectors. -np 10 specifies
to use all 10 of them (obviously!). Similarly machines.NONPROJ specifies all
the machines that aren’t connected to displays, 22 in all. We need only 10 of
those. This just forces MPI to use all the projector machines and then select
from the rest, so set these things to the display wall configuration. MPI really
wants a full path to the executable, so replace /home/wjb/meta/emu/meta
with wherever the meta.exe file happens to reside.
181
Bibliography
[1] Assarsson, U., and Moller, T. Optimized view frustum culling
algorithms. Tech. rep., Chalmers University of Technology, March 2000.
[2] Bajaj, C. L., Pascucci, V., Rabbiolo, G., and Schikore, D. R.
Hypervolume visualization: A challenge in simplicity. In IEEE Sympo-
sium on Volume Visualization (1998), pp. 95–102.
[3] Blanke, W. Multiresolution Techniques on a Parallel Multidisplay Mul-
tiresolution Image Compositing System. PhD thesis, University of Texas
at Austin, 2001.
[4] Blanke, W., Bajaj, C., Fussell, D., and Zhang, X. The meta-
buffer: A scalable multiresolution multidisplay 3-d graphics system using
commodity rendering engines. Tr2000-16, University of Texas at Austin,
February 2000.
[5] Blanke, W., Bajaj, C., Zhang, X., and Fussell, D. A cluster
based emulator for multidisplay, multiresolution parallel image composit-
ing. Tech. rep., University of Texas at Austin, April 2001.
[6] Bunker, M., and Economy, R. Evolution of GE CIG systems. SCSD
Document (1989).
182
[7] C. Cruz-Neira, D. J. S., and DeFanti, T. A. Virtual reality: The
design and implementation of the cave. Computer Graphics 27, 4 (August
1993), 135–142.
[8] Chen, Y., Clark, D., Finkelstein, A., Housel, T., and Li, K.
Automatic alignment of high resolution multi-projector displays using an
un-calibrated camera. In Proceedings of IEEE Visualization Conference
(2000), pp. 125–130.
[9] Coren, S., Ward, L., and Enns, J. Sensation & Perception. Har-
court Brace, New York, NY, 1999.
[10] Crockett, T. W. Parallel rendering. Tech. rep., ICASE, 1995.
[11] Eldridge, M., Igehy, H., and Hanrahan, P. Pomegranate: A fully
scalable graphics architecture. Computer Graphics (SIGGRAPH 2000
Proceedings) (2000), 443–454.
[12] Eyles, J., Molnar, S., Poulton, J., Greer, T., Lastra, A.,
England, N., and Westover, L. Pixelflow: The realization. In Pro-
ceedings of the Siggraph/Eurographics Workshop on Graphics Hardware
(August 1997), pp. 57–68.
[13] Ferrari, F., Nielsen, J., Questa, P., and Sandini, G. Space
variant imaging. Sensor Review 15, 2 (1995), 17–20.
[14] Fitzmaurice, G. Situated information spaces and spatially aware palm-
top computers. Communications of the ACM 36, 7 (July 1993).
183
[15] Foley, J., van Dam, A., Feiner, S., and Hughes, J. Computer
Graphics: Principles and Practice. Addison-Wesley Publishing Com-
pany, Reading, MA, 1990.
[16] Forrest, A. R. Antialiasing in progress. Fundamental Algorithms for
Computer Graphics 17 (1985), 113–134.
[17] Fussell, D. S., and Rathi, B. D. A vlsi-oriented architecture for
real-time raster display of shaded polygons. In Graphics Interface ’82
(May 1982).
[18] Gandhi, R., Khuller, S., and Srinivasan, A. Approximation al-
gorithms for partial covering problems. In Proceedings of ICALP 2001
(July 2001).
[19] Geisler, W., and Perry, J. Variable-resolution displays for visual
communication and simulation. The Society for Information Display 30
(1999), 420–423.
[20] Hanrahan, P. Scalable graphics using commodity graphics systems.
Views pi meeting, Stanford Computer Graphics Laboratory, Stanford Uni-
versity, May 17, 2000.
[21] Heirich, A., and Moll, L. Scalable distributed visualization using off-
the-shelf components. In Parallel Visualization and Graphics Symposium
– 1999 (San Francisco, California, October 1999), J. Ahrens, A. Chalmers,
and H.-W. Shen, Eds.
184
[22] Hochbaum, D. Approximation Algorithms for NP-Hard Problems. PWS
Publishing Company, Boston, MA, July 1996.
[23] Hoppe, H. Smooth view-dependent level-of-detail control and its appli-
cation to terrain rendering. In IEEE Visualization 1998 (October 1998),
pp. 35–42.
[24] Humphreys, G., and Hanrahan, P. A distributed graphics system
for large tiled displays. In Proceedings of IEEE Visualization Conference
(1999), pp. 215–223.
[25] id software. Quake. http://www.quake.com.
[26] Johnson, R. Pthreads-win32. http://sources.redhat.com/pthreads-
win32/.
[27] Kettler, K. A., Lehoczky, J. P., and Strosnider, J. K. Mod-
eling bus scheduling policies for real-time systems. In Proceedings of
16th IEEE Real-Time System Symposium (1995), IEEE Computer Soci-
ety Press, pp. 242–253.
[28] Kilgard, M. Glut. http://reality.sgi.com/opengl/glut3/.
[29] Lamming, M., Brown, P., Carter, K., Eldridge, M., Flynn, M.,
Louie, G., Robinson, P., and Sellen, A. The design of a human
memory prosthesis. The Computer Journal 37, 3 (1994).
[30] Leffler, S. Libtiff. http://www.libtiff.org/.
185
[31] Lombeyda, S., Moll, L., Shand, M., Breen, D., and Heirich, A.
Scalable interactive volume rendering using off-the-shelf components. In
Proceedings of IEEE 2001 Symposium on Parallel and Large-Data Visu
alization and Graphics (2001), IEEE Computer Society Press, pp. 115–
121.
[32] Magillo, P., Floriani, L. D., and Puppo, E. A dimension and
application-independent library for multiresolution geometric modeling.
Tech. Rep. DISI-TR-00-11, University of Genova, Italy, 2000.
[33] Majumder, A., He, Z., Towles, H., and Welch, G. Achieving
color uniformity across multiprojector displays. In Proceedings of IEEE
Visualization Conference (2000), pp. 117–124.
[34] Mammen, A. Transparency and antialiasing algorithms implemented
with the virtual pixel maps technique. IEEE Computer Graphics and
Applications 9, 4 (July 1989), 43–55.
[35] Microsoft. Windows ce embedded visual tools. http://www.microsoft.com/
mobile/downloads/emvt30.asp.
[36] Molnar, S., Cox, M., Ellsworth, D., and Fuchs, H. A sort-
ing classification of parallel rendering. IEEE Computer Graphics and
Applications 14, 4 (July 1994).
[37] Molnar, S. E. Combining z-buffer engines for higher-speed rendering.
In Proceedings of the 1988 Eurographics Workshop on Graphics Hardware
186
(1988), Eurographics Seminars, pp. 171–182.
[38] Molnar, S. E. Image composition architectures for real-time image
generation. Phd dissertation, technical report tr91-046, University of
North Carolina, 1991.
[39] Moreland, K., Wylie, B., and Pavlakos, C. Sort-last parallel ren-
dering for viewing extremely large data sets on tile displays. In Proceed-
ings of IEEE 2001 Symposium on Parallel and Large-Data Visu alization
and Graphics (2001), IEEE Computer Society Press, pp. 85–92.
[40] Muraki, S., Ogata, M., Ma, K.-L., Koshizuka, K., Kajihara,
K., Liu, X., Nagano, Y., and Shimokawa, K. Next-generation
visual supercomputing using pc clusters with volume graphics hardware
devices. In Supercomputing 2001 (2001).
[41] Pardo, F., and Martinuzzi, E. Hardware environment for a retinal
ccd visual sensor. In EU-HCM SMART Workshop: Semi-autonomous
Monitoring and Robotics Technologies (April 1994).
[42] Raskar, R., Brown, M., Yang, R., Chen, W., Welch, G., Towles,
H., Seales, B., and Fuchs, H. Multi-projector displays using cam-
era based registration. In Proceedings of IEEE Visualization Conference
(1999), pp. 161–168.
[43] Saito, N., and Beylkin, G. Multiresolution representations using
the auto-correlation functions of compactly supported wavelets. IEEE
187
Transactions on Signal Processing 41 (December 1993), 3584–3590.
[44] Samanta, R., Zheng, J., Funkhouser, T., Li, K., and Singh,
J. P. Load balancing for multi-projector rendering systems. In SIG-
GRAPH/Eurographics Workshop on Graphics Hardware (August 1999).
[45] Schneider, B.-O. Parallel rendering on pc workstations. In Paral-
lel and Distributed Processing Techniques and Applications (July 1998),
pp. 1281–1288.
[46] SGI. Opengl. http://www.opengl.org.
[47] Shamir, A., Pascucci, V., and Bajaj, C. Multi-resolution dynamic
meshes with arbitrary deformations. Tech. Rep. TICAM 00-07, Univer-
sity of Texas at Austin, March 2000.
[48] Shapiro, J. M. Embedded image coding using zerotrees of wavelet co-
efficients. IEEE Transactions on Signal Processing 41 (December 1993),
3445–3462.
[49] Shaw, C. D., Green, M., and Schaeffer, J. A vlsi architecture for
image composition. In Proceedings of the 1988 Eurographics Workshop
on Graphics Hardware (1988), Eurographics Seminars, pp. 183–199.
[50] Weinberg, R. Parallel processing image synthesis and anti-aliasing.
Computer Graphics 15, 3 (July 1981), 55–61.
188
[51] Weiser, M. Some computer science issues in ubiquitous computing.
Communications of the ACM 36, 7 (July 1993), 65–84.
[52] Wodnicki, R., Roberts, G., and Levine, M. A foveated image
sensor in standard cmos technology. In Custom Integrated Circuits Con-
ference (1995).
[53] Zhang, X., Bajaj, C., and Blanke, W. Scalable isosurface visu-
alization of massive datasets on cots clusters. In Proceedings of IEEE
2001 Symposium on Parallel and Large-Data Visualization and Graphics
(2001), IEEE Computer Society Press, pp. 51–58.
189
Vita
William John Blanke was born in Charlotte, North Carolina on May
21, 1972 to Dianne Kiser Blanke and Robert John Blanke. After graduating
from Charlotte Latin School in 1990, he attended Duke University. During
this time he took summer course work from The University of North Carolina
at Charlotte and interned at the North Carolina Supercomputing Center un-
der a National Science Foundation undergraduate fellowship. He graduated
from Duke in 1994 with a Bachelor of Science in Engineering degree, triple
majoring in electrical engineering, computer science, and history. Afterwards,
he attended The University of Virginia earning a Master of Science degree in
electrical engineering in 1996. Following this, he was employed by PrivNet,
Inc., an Internet startup company which was subsequently bought by PGP,
Inc., a cryptography firm. In 1997, he attended The University of Texas at
San Antonio as a non-degree seeking student. In 1998, he enrolled in The
University of Texas at Austin as a Ph.D. student in computer engineering.
Permanent address: 2932 Houston Branch RoadCharlotte, NC 28270
This dissertation was typeset with LATEX† by the author.
†LATEX is a document preparation system developed by Leslie Lamport as a specialversion of Donald Knuth’s TEX Program.
190