Copyright by William John Blanke 2001

Copyright

by

William John Blanke

2001

The Dissertation Committee for William John BlankeCertifies that this is the approved version of the following dissertation:

Multiresolution Techniques on a Parallel Multidisplay

Multiresolution Image Compositing System

Committee:

Chandrajit Bajaj, Supervisor

Don Fussell

Vijay Garg

Margarida Jacome

Roy Jenevein



by

William John Blanke, B.S.E.,M.S.

DISSERTATION

Presented to the Faculty of the Graduate School of

The University of Texas at Austin

in Partial Fulfillment

of the Requirements

for the Degree of

DOCTOR OF PHILOSOPHY

THE UNIVERSITY OF TEXAS AT AUSTIN

December 2001

Dedicated to Vero.

Acknowledgments

Even though this dissertation lists my name as the author, many people

were instrumental in bringing it to completion. I would like to mention a few of

these names here in appreciation. However, to avoid the risk of leaving anyone

out before doing so I would first like to thank the faculty, students, and staff

of The University of Texas in general. A dissertation involves a lot of advice,

help, mentoring and perhaps most of all paperwork. Without the collective

assistance of The University as a whole, there would be little chance of my

research and the documentation appearing here finding its place in print.

Dr. Don Fussell and Dr. Chandrajit Bajaj started my interest in image

compositing systems. The original ideas for developing the Metabuffer can be

attributed to them. Dr. Fussell especially took an active role in flushing out

the preliminary plans for implementing the Metabuffer. Later, Dr. Bajaj

provided an enormous amount of time and energy suggesting how to simulate

the Metabuffer and adapt it to the cluster. He also offered a great environment

to do the work. I feel privileged to have been able to use the top quality

facilities offered by the visualization lab.

I would also like to thank the other members of my committee: Dr.

Vijay Garg, Dr. Margarida Jacome, and Dr. Roy Jenevein. With the advent

v

of the DVI (Digital Visual Interface) standard, image composition has become

a hot research area. I would like to thank the members of my committee for

bearing with me while my research topic bent and swayed with the rapid twists

and turns of developments in this area.

My software engineering courses taught me to concentrate on how to

use available components as technologies to match with the architecture of

my designs. With the Metabuffer project, this was especially true. Wherever

possible, I employed libraries to implement portions of the system. Because

of this, I have a number of people to thank for offering their code to the

public domain free of charge. First Sam Leffler at SGI for his TIFF image

compression library. I am not sure how many TIFF images I generated in

running the Metabuffer simulator and emulator, but I am sure it must be

over one million. I would also like to thank the team that wrote the MPICH

implementation of MPI, and the pthreads for Windows team which forms the

threading and synchronization base of the Metabuffer simulator. I would also

like to thank Mark Kilgard of GLUT fame, which allowed the Metabuffer

project to move swiftly and easily from Windows, to IRIX, and finally to

Linux without incurring any user interface headaches. Finally, the OCview

library, currently maintained by Xiaoyu Zhang, a fellow CS graduate student,

performed the rendering for the Metabuffer emulator. I am indebted to him

for his personal assistance in adapting his code for my project as well as in

generating the many isosurface data sets seen throughout this dissertation.

vi

In addition to Xiaoyu, several other CS graduate students greatly as-

sisted me in my research. James Yang was instrumental in setting up the

Prism cluster for hosting the Metabuffer. This was no small task given the

atypical custom requirements of adding high performance graphics cards to a

computing cluster. I would also like to thank Christian Sigg for his work au-

tomating much of the cluster’s processes. Even after both of these people had

departed UT, the cluster continued to function without any major issues–a

testament to the quality of their work.

None of this research could ever hope to have been completed without

some major help from the staff in Computer Sciences. Reuben Reyes especially

fielded all kinds of requests and offered any assistance I needed. I would like

to thank Patricia Baxter in TICAM and Melanie Gulick in EE for fixing my

many paperwork mistakes and dealing with my perpetual procrastinating in

all things involving form deadlines.

I consider many of my past professors at previous universities to be

some of my greatest role models. The impact these people had in my studies

influenced me to want to continue with my graduate education. I would like to

thank Dr. Stephen Jones at The University of Virginia and Dr. John Board

at Duke University. Both professors advised me during my stays at those

institutions and I can only hope to be the kind of educator that they have

become; they inspire others to want to learn.

I can never say enough thanks to The University of Texas and the

vii

Cockrell Foundation for offering me the chance to pursue my graduate degree.

With the funding of the MCD scholarship and the Cockrell fellowship, it was

possible for me to fully commit to learning and research instead of worrying

about dollars and cents. Grants contributed by the National Science Founda-

tion also provided additional support. Their role in graduate education can

not be overstated.

viii



Publication No.

William John Blanke, Ph.D.

The University of Texas at Austin, 2001

Supervisor: Chandrajit Bajaj

In most computer graphics applications resolution is a tradeoff. Using low-

resolution images provide a low quality display, but typically allow higher

frame rates because less data needs to be computed. High-resolution images,

on the other hand, give the best display, yet are hindered by slower refresh

times and thus limit user interactivity. Low image quality and low user inter-

activity are both detriments to computer graphics visualization applications.

The question then is what can be done to minimize this impact.

The aim of this dissertation is to explore how to use multiresolution

in order to provide the best balance between image quality and user inter-

activity on a parallel multidisplay multiresolution image compositing system

with antialiasing called the Metabuffer. The architecture of the Metabuffer,

a simulator written in C++, and a Beowulf cluster based emulator are fully

described in this dissertation. Additional supporting hardware and software

ix

detailed in this document include an algorithm to partition data sets into

Metabuffer viewports and a wireless visualization control device.

Using the Beowulf cluster based Metabuffer emulator, two multires-

olution techniques are studied: progressive image composition and foveated

vision. Progressive image composition allows the user to rapidly change view-

points without immediately moving data between PCs. Instead, the resolution

of each PC’s viewport adjusts in order to cover the visible polygons for which it

is responsible. The larger, low-resolution viewports have lower image quality,

but the user sees no drop in frame rate. Over time, the PCs can readjust their

data in order to shrink their viewports and provide high-resolution imagery.

Foveated vision allows computing resources to be concentrated only where the

user is actually focused. Human peripheral vision cannot discern high lev-

els of detail. Rendering the periphery with a low polygon count using a few

low-resolution viewports allows the majority of the machines to render high-

resolution viewports only where the user (or users) are looking thus increasing

the frame rate.

x

Table of Contents

Acknowledgments v

Abstract ix

List of Tables xvii

List of Figures xviii

Chapter 1. Introduction 1

1.1 Motivation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.3 Contributions . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

Chapter 2. Background and Related Work 9

2.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.2 Sort First . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.2.1 Recent Multidisplay Systems . . . . . . . . . . . . . . . 12

2.3 Sort Middle . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

2.4 Sort Last . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.4.1 Recent Single Display Systems . . . . . . . . . . . . . . 18

2.4.2 Recent Multidisplay Systems . . . . . . . . . . . . . . . 20

2.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

xi

Chapter 3. Metabuffer Architecture 25

3.1 Metabuffer Architecture . . . . . . . . . . . . . . . . . . . . . 25

3.2 Bus Dataflow . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.2.1 Analysis of Bus Data Flow . . . . . . . . . . . . . . . . 29

3.2.2 Buffering of Bus Data Flow . . . . . . . . . . . . . . . . 34

3.3 IRSA Round Robin Bus Scheduling . . . . . . . . . . . . . . . 35

3.4 Sequence of Metabuffer Operations . . . . . . . . . . . . . . . 36

3.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

Chapter 4. Metabuffer Simulator 40

4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 40

4.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 41

4.3 Multiresolution Output . . . . . . . . . . . . . . . . . . . . . . 42

4.4 Antialiasing Output . . . . . . . . . . . . . . . . . . . . . . . . 44

4.5 Transparency Output . . . . . . . . . . . . . . . . . . . . . . . 46

4.5.1 Interpolated Transparency . . . . . . . . . . . . . . . . . 47

4.5.2 Multipass Methods . . . . . . . . . . . . . . . . . . . . . 48

4.5.3 Screen Door . . . . . . . . . . . . . . . . . . . . . . . . 49

4.5.4 Metabuffer Implementation . . . . . . . . . . . . . . . . 49

4.6 Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.7 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

Chapter 5. Metabuffer Emulator 54

5.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 54

5.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 55

xii

5.2.1 Granularity . . . . . . . . . . . . . . . . . . . . . . . . . 55

5.2.2 MPI Mapping . . . . . . . . . . . . . . . . . . . . . . . 57

5.2.3 Plugin API . . . . . . . . . . . . . . . . . . . . . . . . . 58

5.3 Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.3.1 Plugins . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.3.2 Future Work . . . . . . . . . . . . . . . . . . . . . . . . 62

5.3.3 Undocumented Features . . . . . . . . . . . . . . . . . . 62

5.4 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

Chapter 6. Greedy Viewport Allocation Algorithm 64

6.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 64

6.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 65

6.2.1 Sort First Algorithms . . . . . . . . . . . . . . . . . . . 65

6.2.2 Sort Last Techniques . . . . . . . . . . . . . . . . . . . . 67

6.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 70

6.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 73

6.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 76

Chapter 7. Wireless Visualization Control Device 77

7.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 77

7.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

7.2.1 Ubiquitous Computing . . . . . . . . . . . . . . . . . . . 79

7.2.2 Augmented Reality . . . . . . . . . . . . . . . . . . . . 80

7.2.3 Context-Aware Applications . . . . . . . . . . . . . . . 81

7.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 81

xiii

7.4 Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86

7.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

Chapter 8. Progressive Image Composition Plugin 90

8.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

8.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

8.2.1 Progressive Transmission . . . . . . . . . . . . . . . . . 92

8.2.2 Progressive Refinement . . . . . . . . . . . . . . . . . . 93

8.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 94

8.3.1 Initial Triangle Assignment . . . . . . . . . . . . . . . . 94

8.3.2 Viewport and Resolution Determination . . . . . . . . . 95

8.3.3 Data Exchange . . . . . . . . . . . . . . . . . . . . . . . 100

8.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

8.4.1 Oceanographic . . . . . . . . . . . . . . . . . . . . . . . 103

8.4.2 Santa Barbara . . . . . . . . . . . . . . . . . . . . . . . 106

8.4.3 Visible Human . . . . . . . . . . . . . . . . . . . . . . . 109

8.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 112

Chapter 9. Foveated Vision Plugin 114

9.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

9.2 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . 116

9.2.1 Image Processing . . . . . . . . . . . . . . . . . . . . . . 116

9.2.2 Image Transmission . . . . . . . . . . . . . . . . . . . . 117

9.2.3 Image Generation . . . . . . . . . . . . . . . . . . . . . 118

9.3 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . 119

xiv

9.3.1 Continuous Method . . . . . . . . . . . . . . . . . . . . 120

9.3.2 Discrete Method . . . . . . . . . . . . . . . . . . . . . . 122

9.3.3 Load Balancing . . . . . . . . . . . . . . . . . . . . . . . 123

9.3.4 Compositing . . . . . . . . . . . . . . . . . . . . . . . . 127

9.3.5 Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . 128

9.4 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

9.4.1 Visible Human . . . . . . . . . . . . . . . . . . . . . . . 130

9.4.2 Engine . . . . . . . . . . . . . . . . . . . . . . . . . . . 134

9.4.3 Skeleton . . . . . . . . . . . . . . . . . . . . . . . . . . . 137

9.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 141

Chapter 10. Conclusion and Future Work 143

10.1 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 143

10.2 Limitations of the Metabuffer . . . . . . . . . . . . . . . . . . 146

10.3 Limitations of the Applications . . . . . . . . . . . . . . . . . . 147

10.4 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . 148

10.5 Conclusion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 149

Appendix 150

Appendix A. Simulator Classes 151

A.0.1 Class CClock . . . . . . . . . . . . . . . . . . . . . . . . 151

A.0.2 Class CComposerPipe . . . . . . . . . . . . . . . . . . . 153

A.0.3 Class CComposerQueue . . . . . . . . . . . . . . . . . . 159

A.0.4 Class CInFrameBus . . . . . . . . . . . . . . . . . . . . 161

A.0.5 Class COutFrame . . . . . . . . . . . . . . . . . . . . . 168

xv

Appendix B. Emulator Distribution 174

B.1 Contents . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 174

B.2 Building the Metabuffer Emulator . . . . . . . . . . . . . . . . 175

B.2.1 glut-3.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . 175

B.2.2 tiff-v3.5.5 . . . . . . . . . . . . . . . . . . . . . . . . . . 178

B.2.3 ocview . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

B.2.4 emu . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 179

B.3 Running the Metabuffer Emulator . . . . . . . . . . . . . . . . 180

Bibliography 182

Vita 190

xvi

List of Tables

2.1 Current parallel rendering systems . . . . . . . . . . . . . . . 11

3.1 Viewport control information . . . . . . . . . . . . . . . . . . 28

3.2 Case one: bandwidth analysis . . . . . . . . . . . . . . . . . . 30

3.3 Case two: bandwidth analysis . . . . . . . . . . . . . . . . . . 31

3.4 Case three: bandwidth analysis . . . . . . . . . . . . . . . . . 33

3.5 Case four: bandwidth analysis . . . . . . . . . . . . . . . . . . 34

8.1 Progressive data set information . . . . . . . . . . . . . . . . . 102

9.1 Foveated data set information . . . . . . . . . . . . . . . . . . 129

xvii

List of Figures

2.1 SHRIMP zoom out timings for horse model . . . . . . . . . . 15

3.1 Metabuffer architecture . . . . . . . . . . . . . . . . . . . . . . 26

3.2 Case one: single screen viewport . . . . . . . . . . . . . . . . . 30

3.3 Case two: four screen viewport . . . . . . . . . . . . . . . . . 31

3.4 Case three: four screen low resolution viewport . . . . . . . . 32

3.5 Case four: nine screen low resolution viewport . . . . . . . . . 33

4.1 Simulator class instance organization . . . . . . . . . . . . . . 42

4.2 Rayshade generated input images with viewport configuration 43

4.3 Composited simulator output images . . . . . . . . . . . . . . 44

4.4 Zoomed image without (left) and with (right) antialiasing . . . 45

4.5 Screen door transparency Metabuffer output . . . . . . . . . . 50

4.6 Zoom of transparency example . . . . . . . . . . . . . . . . . . 51

5.1 Emulator class instance organization . . . . . . . . . . . . . . 56

6.1 Viewport configuration for horse example. . . . . . . . . . . . 73

6.2 Greedy algorithm timings for various model sizes . . . . . . . 75

7.1 Wireless visualization device user interface . . . . . . . . . . . 83

7.2 Wireless visualization operation . . . . . . . . . . . . . . . . . 85

xviii

8.1 Asymmetrical frustum illustration . . . . . . . . . . . . . . . . 99

8.2 Sample frames from the oceanographic movie . . . . . . . . . 104

8.3 Rendering times for oceanographic movie frames . . . . . . . . 105

8.4 Sample frames from the Santa Barbara movie . . . . . . . . . 107

8.5 Rendering times of Santa Barbara movie frames . . . . . . . . 108

8.6 Sample frames from the visible human movie . . . . . . . . . . 110

8.7 Rendering times for visible human movie frames . . . . . . . . 111

8.8 Composited visible human in visualization lab . . . . . . . . . 111

9.1 Coren’s acuity graph . . . . . . . . . . . . . . . . . . . . . . . 119

9.2 Foveated pyramid for visible human example . . . . . . . . . . 125

9.3 Sample frames from the visible human movie . . . . . . . . . . 132

9.4 Rendering times for visible human movie frames . . . . . . . . 133

9.5 Sample frames from the engine movie . . . . . . . . . . . . . . 135

9.6 Rendering times for engine movie frames . . . . . . . . . . . . 136

9.7 Sample frames from the skeleton movie . . . . . . . . . . . . . 138

9.8 Rendering times for skeleton movie frames . . . . . . . . . . . 139

xix

Chapter 1

Introduction

1.1 Motivation

In most computer graphics applications resolution is a tradeoff in terms

of frame rate. Using low resolution images provide a low quality display, but

typically allow higher frame rates because less data needs to be computed.

High resolution images, on the other hand, give the better display quality, yet

are hindered by slower refresh times and thus limit user interactivity. Low im-

age quality and low user interactivity are both detriments to computer graphics

visualization applications. The question then is what can be done to minimize

this impact.

1.2 Background

Probably the most well known example of this tradeoff is the popular

computer game, Quake [25]. The Quake user faces three choices. One, he or

she can run the game in the highest resolution the computer can currently

support yielding a beautiful visual experience. Doing so, however, will likely

drop the frame rate of the game, and thus limit how well the Quake user can

1

interact with the environment–essentially the other Quake participants playing

concurrently in online Quake death matches (games where opponents do battle

against each other in a computer generated simulation). The reduced user

interactivity will cause the Quake user to become easy prey for murderous co-

players. Two, the Quake user can decide to use the lowest resolution possible.

The display is terrible, but the frame rate is quick and the player’s responses

are as well. The Quake user is now competitive with the rest of the players in

the death match. Three, the user can opt to upgrade his or her system to a

faster processor and video card by spending hundreds or thousands of dollars.

This will result in great graphics and quick response, though perhaps a much

lighter wallet. The choices most Quake players make are obvious. Those with

trust funds choose three. Those on work-study grants choose two.

In the field of scientific visualization, money concerns are, to an extent,

less important than results. If it were possible to improve a visualization ap-

plication by merely spending more money on a faster processor or a better

performing rendering board, it would likely be done. High priced SGI com-

puting platforms, for instance, sell in low, but profitable, quantities. In most

cases, the money spent on hardware is more than offset by the time saved and

capabilities garnered.

However, today imaging and simulations are increasingly yielding larger

and larger data streams. These data sets can range in size from gigabytes to

terabytes of information. Such data sets are much too large to store and

2

render on a single machine–even a pricey SGI. Viewing these large data sets

poses yet another problem. In some cases the detail allowed by a single high

performance monitor may not be adequate for the resolution required. To

cope with these issues, many systems have been designed which use parallel

computation and tiled screen displays. Dividing the data set among a number

of computers reduces its enormous bulk to more reasonably sized chunks that

can be quickly rendered. Likewise, using tiled displays results in a larger

amount of display space. Small details that might be culled out on a single

monitor can be spotted in an immersive visualization laboratory with hundreds

of square feet of screen space.

These current parallel, multidisplay systems share common problems,

however. Because they all depend on data locality in some type of form (di-

viding the data set evenly among the processors), changing the viewpoint of

the user can often wreck any careful load balancing done on the data set.

An unevenly load balanced data set will significantly degrade the frame rate

which a user experiences. Even worse, in some cases if the tiled displays are

linked only to certain machines, large quantities of data or pixels may need to

be moved immediately simply to render the frame correctly. This can result

in a significant delay to the user. Also, large tiled displays require immense

amounts of computing resources to render. This is despite the fact that, in

most cases, much of the display is either not in the user’s view or is only within

the user’s peripheral vision. Current parallel, multidisplay systems are limited

in how they can allocate their computing resources to cope with a partially

3

viewed scene in order to accelerate the possible frame rate.

The thesis of my research is that multiresolution techniques can elim-

inate data locality and resource allocation problems in parallel multidisplay

systems that render interactive large scale data streams by providing an es-

sential balance between display quality and frame rate.

1.3 Contributions

The primary contributions of this dissertation are:

1. The architecture for a parallel multidisplay multiresolution im-

age compositing system: This architecture, called the Metabuffer,

is flexible enough that the number of rendering servers can scale in-

dependently from the number of display tiles. In addition, since the

Metabuffer allows the viewports to be located anywhere within the to-

tal display space and overlap each other, it is possible to achieve a much

higher degree of load balancing. Since the viewports can vary in size, the

system supports multiresolution rendering, for instance allowing a single

machine to render a background at low resolution while other machines

render foreground objects at much higher resolution. The architecture

also supports antialiasing and transparency.

2. The Metabuffer hardware simulator written in C++: To test

the architecture of the Metabuffer, a simulator was written to mimic the

4

hardware in C++. The major components of the Metabuffer architecture

were coded as classes. By creating or deleting instances of the classes,

it is possible to easily test large or small Metabuffer configurations. The

simulator proves that the architecture can perform parallel, multidisplay,

multiresolution image compositing without glitches.

3. The Metabuffer emulator running on a Beowulf cluster using

MPI and GLUT: In order to test applications developed for the Meta-

buffer, a emulator was written in software that mimics the operation

of the hardware but is encoded to perform as efficiently as possible on

the Beowulf cluster. While sort last systems running completely in soft-

ware are possible [39], because the approach of the Metabuffer hardware

depends on heavily parallel I/O and pipelined compositing, the limited

I/O and single processors of the individual cluster machines are not ide-

ally suited to emulating it. The large communication requirements of

so much pixel data make it difficult to map the Metabuffer architec-

ture to a standard cluster with machines that have only a single limited

bandwidth system bus. In addition, adding large numbers of machines

to a cluster to achieve pipelined computation streams causes the com-

putation granularity to be too fine relative to communication overhead.

This greatly reduces efficiency. It is for these reasons that sort last sys-

tems such as the Metabuffer usually require hardware implementations

rather than running in software. However, a workable, though not scal-

able, implementation of the Metabuffer has been created in software with

5

coarse parallel granularity using the MPI library to pass Metabuffer I/O

over the Beowulf cluster’s network connections and the GLUT library (a

cross-platform GUI layer for OpenGL [46] applications) to render and

display image data. A plugin API is used with this emulator testbed

to write applications which interface to the Metabuffer using only a few

standard calls.

4. A greedy algorithm for creating Metabuffer viewports to cover

the data set in order to render all polygons: In order to quickly

divide data sets into even chunks for the rendering servers to process, a

greedy algorithm was developed that uses a simple heuristic to partition

the polygons in a quick and hopefully load balanced manner.

5. Wireless visualization control device: Using Pocket PC devices

equipped with wireless Ethernet, a Windows CE client application was

written in conjunction with a Linux server to allow multiple users to re-

motely control the operations of the Metabuffer emulator plugins. Sim-

ply tapping the display of the Pocket PC device controls the orientation

of objects being viewed. The control device is also currently being used

to position the lines of sight of users for the foveated vision plugin until a

wireless gaze tracking headset is available. In the future the device may

feature region of interest (ROI) tracking in which user history, current

viewpoint, and object features are all taken into account. Collaborative

user interface ideas could also be explored when multiple devices interact

with the same display.

6

6. Progressive image compositing using the multiresolution capa-

bilities of the Metabuffer: A Metabuffer emulator plugin was writ-

ten to test the possibilities of using multiresolution for progressive image

compositing. If the user happens to change views of a scene, and poly-

gons local to a rendering server no longer fit within a high resolution

viewport, that viewport can enlarge and become low resolution, rather

than necessitating the shifting of polygons to other rendering servers. In

this way the user’s frame rate remains constant. When the user stops at

a scene to study it further, the polygons can be redistributed in order to

form high resolution viewports once again. This technique is analogous

to progressive refinement in the case of World Wide Web images. The

user can navigate quickly through web pages containing low resolution

images. When he or she finally arrives at the correct page, only then are

high resolution images downloaded.

7. Foveated vision using the multiresolution capabilities of the

Metabuffer: A Metabuffer emulator plugin was written to test the

possibilities of using multiresolution for foveated vision applications. The

human eye cannot discern high levels of detail in its peripheral vision.

This can be exploited by rendering the periphery using lower polygon

counts and lower resolution. Large areas of screen space can be rendered

by only a few rendering servers. Meanwhile, the majority of rendering

machines concentrate their work only where the user is actually looking.

This makes efficient use of rendering resources, especially in cases where

7

the display space is quite large and thus improves the user’s frame rate.

A chapter in this dissertation is devoted to each of these contributions.

Chapter 10 summarizes some of the limitations of this research and proposes

avenues for future work.

8

Chapter 2

Background and Related Work

2.1 Introduction

Today imaging and simulation applications are increasingly yielding

larger and larger data streams. Visualizing these large data streams inter-

actively may be difficult or impossible with a single computer. Because of

this, many research groups have studied the problem of visualizing data sets

in parallel. Schneider analyzes the suitability of PCs for parallel rendering of

single and multiple frames on symmetric multiprocessors and clusters [45]. In

general, most of these parallel rendering systems, with the notable exceptions

of hybrid systems such as Pomegranate [11], can be classified into three dif-

ferent categories depending on where the data is sorted from object-space to

image-space as shown by Molnar [36]. Crockett [10] describes various consid-

erations in building parallel systems and the tradeoffs associated with these

three categories.

Even with powerful parallel systems to render the data, in some cases

single high performance monitors may not have adequate resolution to resolve

the detail of large data sets. The use of multiple displays in tiled configura-

9

tions is an accepted way to gain very high resolution displays. Using separate

displays to display a single image, of course, has a few problems. Issues with

aligning the images of the multiple displays have been studied by both Chen [8]

and Raskar [42]. Once the images are aligned, color variations between the dis-

plays and even across the displays themselves has to be corrected. Majumder

[33] deals with the color uniformity question.

This chapter describes some of the recent systems created by others in

the parallel rendering arena and shows where the work with the Metabuffer

fits in this group. The systems are divided according to Molnar’s three sorting

categories and further subdivided by whether they work with single or multiple

displays. Section 2.2 discusses sort-first parallel rendering systems and their

tradeoffs. Section 2.3 talks about the sort-middle technique (rarely used for

cluster configurations). Section 2.4 lists the sort-last rendering systems (the

category to which the Metabuffer belongs). Each category has its benefits

and its drawbacks, and these issues are discussed in each section. Finally

section 2.5 describes the reasoning for choosing the sort last method for the

Metabuffer and why this method lends itself better to multiresolution support

than the others. Figure 2.1 is an overview of this chapter and shows each

parallel rendering system and its feature set properly classified.

10

Syst

emD

evel

oper

Cla

ssD

ispl

ayA

rchi

tect

ure

Pom

egra

nate

Stan

ford

Hyb

rid

Sing

leC

usto

mre

nder

ing

hard

war

eW

ireG

LSt

anfo

rdSo

rtFir

stM

ulti

ple

Com

puti

ngcl

uste

rSH

RIM

PP

rinc

eton

Sort

Fir

stM

ulti

ple

Com

puti

ngcl

uste

rP

ixel

Flo

wU

NC

Sort

Las

tSi

ngle

Cus

tom

rend

erin

gha

rdw

are

Sepi

aC

alTec

hSo

rtLas

tSi

ngle

Serv

erN

etII

w/F

PG

Abo

ards

Lig

htni

ng-2

Inte

lSo

rtLas

tM

ulti

ple

Cus

tom

com

posi

ting

hard

war

eM

etab

uffer

UT

Aus

tin

Sort

Las

tM

ulti

ple

Cus

tom

com

posi

ting

hard

war

e

Tab

le2.

1:C

urre

ntpa

ralle

lren

deri

ngsy

stem

s

11

2.2 Sort First

In the sort-first approach, the display space is broken into a number

of non-overlapping display regions which can vary in size and shape. Be-

cause polygons are assigned to the rendering process before geometric process-

ing, sort-first methods may suffer from load imbalance in both the geometric

processing and rasterization if polygons are not evenly distributed across the

screen partitions.

2.2.1 Recent Multidisplay Systems

WireGL

The WireGL software suite [24] takes an innovative approach to parallel

rendering. Essentially, it is transparent to the hosting application. WireGL

replaces the standard OpenGL dynamic link library used with Microsoft’s

operating systems. Instead of processing OpenGL commands and sending

the results to a local display as the standard OpenGL library would do, the

WireGL library sorts the OpenGL commands depending on screen location and

then transmits these commands over a high speed network to remote servers.

The servers then perform the actual rendering and show the results on their

own local display. This can effectively allow for a large multitiled display

without any modifications to the hosting application. In fact, a favorite test

application of the WireGL team is the computer game Quake, mentioned at

the start of this dissertation, which is reported to have playable interactive

12

frame rates when running under WireGL on a large tiled display.

Care must be taken to parse the OpenGL command stream properly.

OpenGL works like a state machine, so splitting the command stream among

several servers must ensure that commands are correctly placed to keep all the

machines in the proper mode. WireGL does this by duplicating some com-

mands, offsetting this by, interestingly enough, culling needless repetition in

the OpenGL stream. Apparently C++ programs are notorious for reinitializ-

ing OpenGL state even when not really necessary.

There are a few drawbacks to using this approach, however. Polygons

must be distributed from a central server to multiple outlying renderers. This

by itself limits the scalability and hence the usefulness of the system for ren-

dering large data sets. Like all sort first systems, WireGL suffers from load

imbalance due to nonhomogeneous polygon distribution. Also many polygons

will need to be rendered multiple times if they fall on the edges of the display

tiles. Still, WireGL is a very attractive system for transparently obtaining

large tiled displays for moderate polygon count applications.

SHRIMP

The Princeton University SHRIMP (Scalable High-performance Really

Inexpensive Multi-Processor) project [44] uses the sort-first approach to bal-

ance the load of multiple PC graphical workstations. The screen space is

partitioned into blocks that are assigned to different servers. These blocks do

13

not overlap–they abut. Each rendering server is responsible for the polygons

that fall within the blocks that are assigned to it. If some polygons happen to

fall into multiple blocks owned by different servers, those polygons will need

to be rendered multiple times–once by each server. The SHRIMP project at-

tempts to control communication bandwidth by assigning the blocks to the

same server that is running the display where that block resides. Otherwise,

pixels must be communicated to the correct display server from the rendering

server.

The SHRIMP project suffers from several overhead disadvantages which

are a result of its sort-first architecture. The first is the requirement of non-

overlapping blocks which necessitates rendering the polygons that do overlap

multiple times. Using smaller blocks gives better load balancing, but also

introduces severe overlap penalties. The second is the need to transmit pixels

from rendering servers to the correct display if those blocks are not already

local to the display. The current SHRIMP cluster runs with m rendering

servers on n displays, where m = n. Scaling m >> n would result in this pixel

transfer time growing enormously. Third and finally, and most troublesome

for frame rate considerations, changing user viewpoints can severely upset the

block assignment load balancing. Currently, blocks are assigned to processors

using one of three different load balancing algorithms: grid bucket assignment,

grid bucket union, and kd-split. However, all three share the same problem.

When the user moves or zooms around the scene, polygons move to different

blocks resulting in load imbalance penalties. Transmitting polygons to even

14

the load results in even more time used. For example, a zoomed in scene

could be evenly divided among all the rendering servers. Zooming out might

concentrate all the polygons into a single block, necessitating that they be

reorganized.

Figure 2.1: SHRIMP zoom out timings for horse model

Figure 2.1 shows the results from a SHRIMP project paper during a

zoom in operation on a horse mesh model. Because the experiment is a simple

zoom operation, polygons never have to be transmitted from one machine to

another. A polygon assigned to a certain region will always remain in that

region. The only difference is that the region grows in size. This fact spares

the example from the load imbalance and polygon transmission time penalties.

However, polygon overlap and pixel transmission still cause problems for the

SHRIMP architecture.

15

Even without polygon transmission penalties, from the graph it is easy

to see that user frame rates vary greatly during the operation. At the first

frame, the horse is zoomed out–probably lying in a single display on the tiled

display space. Regions of the horse are rendered by different machines in the

cluster, but pixels from these regions need to be transferred to the machine

that owns that single display. The pixel transfer overhead is clearly evident in

the graph. At the final frame, the horse has been zoomed in until it fills the

entire tiled display. Here, the polygons are much more uniformly distributed

over all the displays. Machines rendering regions of the horse most likely only

need to send their pixels to the local display.

This dissertation will demonstrate how multiresolution techniques, specif-

ically progressive image composition on the Metabuffer, effectively solves the

frame rate variation due to these problems that are evident in the SHRIMP

project, a current state of the art sort first parallel multidisplay rendering

system.

2.3 Sort Middle

In the sort-middle case, the polygon assignment is done in the middle

of the rendering pipeline–after the polygons have been processed to determine

their display coordinates and before they have been rasterized. The main

disadvantage of this technique is that almost all of the polygons need to be

retransmitted between the two steps. This amount of communication makes it

16

unattractive for loosely coupled parallel rendering systems involving clusters of

stand alone machines. However, this is the most common method for dedicated

hardware rendering systems. It is simple, and because these closely knit pieces

of hardware can redistribute the polygons rapidly, it is fast for low numbers

of processing units.

Because this dissertation deals with rendering extremely large data sets

on large, loosely coupled clusters, sort middle will not be discussed further in

this report.

2.4 Sort Last

The sort-last approach is also known as image composition. Each ren-

dering process performs both geometric processing and rasterization indepen-

dent of all other machines in the system. Local images rendered on the render-

ing processes are composited together to form the final image. The sort-last

method makes the load balancing problem easier since screen space constraints

are removed. However, compositing hardware is needed to combine the output

of the various processors into a single correct picture.

Such approaches have been used since the 60’s in single-display systems

[6, 17, 37, 38, 49, 50]. More recent work includes the PixelFlow [12], Sepia

[21], and AIST [40] systems. Multiple display systems, which are the focus of

this dissertation, include Lightning-2 [20] and the Metabuffer [4].

17

2.4.1 Recent Single Display Systems

PixelFlow

The PixelFlow [12] system developed at the University of North Car-

olina is a completely custom piece of hardware. Even the rendering engines are

custom and part of the architecture. This differs from the Sepia, Lightning-2,

and Metabuffer projects which use COTS (Commercial Off The Shelf) graphics

cards in order to render the polygons.

Essentially the PixelFlow architecture chains together rendering boards,

followed by shader boards, followed by a frame buffer board on a high speed

backplane. A parallel host computer provides graphics primitive and shading

information to each board. The boards then take this information and render

the display in 128 by 128 pixel chunks. This is done with the assistance of

a 128 by 128 SIMD processor array located on each rendering board. The

rendering boards also have other coprocessors to do geometry processing and

polygon sorting. The chunks are composited as they go down the backplane,

and then lighting and shading is performed by the shader boards until finally

the finished image is stored in the output frame buffer.

The PixelFlow system is a very powerful architecture. However, its

all-custom design might be a problem with the rapid pace of technology. Al-

though integrating the rendering engines into the architecture certainly pro-

vides a speed advantage, with the swift improvements in COTS graphics cards

this could be considered a drawback. Compositing systems such as Sepia,

18

Lightning-2, and the Metabuffer, which deal only with pixel output from COTS

cards, can adapt easily to newer and better COTS graphics card designs. They

only need to deal with video pixel transmission resolution standards, which

change much more slowly than COTS rendering performance. Provided the

new video card drivers support some manner of Z buffer value extraction,

simply replace the older cards with the latest and greatest. No change in cus-

tom hardware is required. Also, the PixelFlow system was not designed with

multiple displays in mind.

Sepia

One of the more recent cluster based sort-last image compositing sys-

tems is the Sepia project [21]. In a completely opposite tact to the Pix-

elFlow system, the Sepia, except for programmed FPGA chips, relies entirely

on COTS equipment and shuns custom chips and circuit boards. Sepia uses

multiple Compaq Pamette FPGA prototyping boards in conjunction with a

Beowulf cluster and a Compaq ServerNet II network. The Pamette commu-

nicates with the Beowulf cluster and the ServerNet II network using standard

PCI bus interfaces. This setup greatly leverages existing COTS technology.

The Pamette prototyping boards are configured to be pixel merge en-

gines. Pixel merge engines take input from their host PC and composite it (or

perform other mathematical operations) with data arriving from the Server-

Net II network. The output of this operation is then sent over the ServerNet

19

II network to another pixel merge engine on a different computer to form a

computational pipeline. When the data is finally ready to be viewed, it is sent

to a pixel merge engine which relays it to a frame buffer on its host computer

for display.

The Sepia system is intriguing because of its use of standard compo-

nents. Programmed FPGAs are really the only custom hardware needed. This

means that a system can be developed rapidly and for a relatively low cost

compared to custom hardware design. The main disadvantage of the Sepia

system is that it requires image data to be sent to and from host PCs over

the system’s PCI bus. This bus is likely to be already overloaded with data

from the rendering application and is limited by bandwidth. Also, the Sepia

system provides no way to send data from a single rendering server to multiple

pipelines. This limits its possibilities for multidisplay use. Currently the Sepia

team is exploring options to utilize the DVI (Digital Visual Interface) port on

commodity graphics cards to ship digital image data directly off the card and

avoid the PCI bus, similar to what the Metabuffer and Lightning-2 designs

employ.

2.4.2 Recent Multidisplay Systems

Lightning-2

The Lightning-2 system [20] developed by Intel and Stanford is another

recent cluster based entry into the parallel multidisplay rendering arena. It

20

appeared at the same time as the Metabuffer project and shares many basic ar-

chitectural features. Like the Metabuffer, it uses a bus and pipeline crossbar in

order to communicate image data and composite it to form a final display. At

each bus/pipeline connection is a large FPGA which is programmed to choose

pixels from the bus and composite them with data arriving on the pipeline.

Also like the Metabuffer, it employs the DVI port on recently made graphics

cards in order to offload pixel data from the rendering machines without load-

ing down the PCI bus or its system bus. However, unlike the Metabuffer, the

Lightning-2 method used to perform compositing does not allow multiresolu-

tion. The Lightning-2 also does not provide antialiasing support.

Metabuffer

The Metabuffer [4] hardware supports a scalable number of PCs and an

independently scalable number of displays–there is no a priori correspondence

between the number of renderers and the number of displays to be used. It also

allows any renderer to be responsible for any axis-aligned rectangular viewport

within the global display space at each frame. Such viewports can be modified

on a frame-by-frame basis, can overlap the boundaries of display tiles and

each other arbitrarily, and can vary in size up to the size of the global display

space. Thus each machine in the network is given equal access to all parts of

the display space, and the overall screen is treated as a uniform display space,

that is, as though it were driven via a single, large frame buffer, hence the

name Metabuffer.

21

Because the viewports can vary in size, the system supports multi-

resolution rendering, for instance allowing a single machine to render a back-

ground at low resolution while other machines render foreground objects at

much higher resolution. Also, because the Metabuffer supports supersampling,

antialiasing is possible as well as transparency using the screen door method.

2.5 Discussion

It was decided to design the Metabuffer as a sort last system because

of the inherent flexibility the method allows for load balancing. For example,

because they are sort-last systems, neither the Sepia, Lightning-2, or Meta-

buffer devices incur any of the polygon overlap penalties evident with the

SHRIMP project. Regions may overlap each other, so there is no reason to

render a polygon twice, provided the polygon is not zoomed in to be so large

as to completely exceed the bounds of a viewport. Also, there is no pixel

transmission overhead associated with the Lightning-2 and Metabuffer sort

last systems. The architectures are designed to efficiently shuttle pixels from

renderer to any display in the global display space. Compare this to SHRIMP

when pixel transmission penalties occur whenever the local display is not used.

SHRIMP, Sepia, and Lightning-2 all do share two common problems,

though. The first is changing user viewpoints. As discussed before with

SHRIMP, changing the user’s viewpoint, either by rotating the data set, zoom-

ing it, or looking at a different area, will likely cause polygons to fall into and

22

out of the screen regions that the renderering machines have been assigned. In

the best case, this will simply cause a load imbalance resulting in an inefficient

use of the rendering resources. In the worst case, the machine may not be able

to cover all of the polygons it is assigned and certain polygons may not be

able to be rendered at all unless they can be transmitted to another machine

immediately. This double edged sword results in time penalties both for load

imbalance and for transmission over the network to move polygons from one

rendering machine to another.

The second problem all share is limited resource allocation flexibility.

Just like SHRIMP, if the devices are driving a very, very large display, ren-

dering that display is an all or nothing event. The entire display is rendered

in high resolution. Typically the user (or users) looking at the display may

only be studying a certain small area. The unviewed regions are wasted. Good

examples of this are CAVE [7] type virtual reality configurations. Only a small

part of the cave is viewed at any one time. Ideally, the majority of rendering

resources should be concentrated only where the users are looking. This will

improve the frame rate of the application and thus increase user responsive-

ness.

The Metabuffer attempts to solve these two issues by including mul-

tiresolution support. This allows for the progressive image composition and

foveated vision techniques that are discussed later in this dissertation. The

Metabuffer also has several other unique features not duplicated in the simi-

23

lar Lightning-2 architecture, namely antialiasing and transparency using the

screen door method in conjunction with pixel replication. These will be dis-

cussed in the architecture section.

24

Chapter 3

Metabuffer Architecture

3.1 Metabuffer Architecture

The architecture of the Metabuffer presents a number of challenges.

The most difficult problem is the large amount of data that must be processed.

Each pixel needs RGB color, Z order, and alpha information. A single frame

will have millions of pixels. A real-time rendered animation should display

approximately 30 frames per second in order to be fluid and smooth. Multiply

all of this by several rendering engines and several output displays and the

large quantities of data involved are clearly evident.

Figure 3.1 shows how a Metabuffer architecture using three rendering

engines and four output displays utilizes multiple pipelined data paths and

busses to surmount this problem. External to the board, COTS (Commer-

cial Off-the-Shelf) rendering engines (A) deliver their data to on-board frame

buffers (B) by means of the recently adopted industry standards for digital

video transmission, the Digital Visual Interface (DVI). Since COTS rendering

engines (A), at this time, transfer only 24 bits per pixel over these digital links,

color is transferred on even frames, while alpha and Z information is trans-

25

C C C C

C C C C

C C C C

Compositing Unit

Meta-Buffer

Display

B

B

B

A

A

A

Rendering Engine Frame Buffer

� � ��

� � ��

PC Workstations

Figure 3.1: Metabuffer architecture

26

ferred on odd frames. At a refresh rate of 60 hertz, this is still fast enough

to provide enough RGB, alpha and Z information for 30 frames per second.

The on-board frame buffer (B) stores information from both transmissions in

memory. Control information, such as the location of the viewports and their

final destination in the overall display, is stored on the first scan line of each

rendering engine’s image (A). This first scan line is never displayed. Instead,

DSP code, viewport data, or anything else that is needed by the control logic

of the frame buffer can be written here using standard OpenGL glDrawPixels()

calls.

When a full frame has been buffered, data is selectively sent over a

wide bus to the composer units (C) based on viewport locations. The com-

posers (C) take only the data that is required to build their column’s output

image and ignore the rest. Each composer (C) then sends its data in pipeline

fashion down the column to the next lower composer (C) so that the pixel Z

order information can be compared with those Z values from the other COTS

renderers (A). This way, only the front-most pixel is saved. The collaged data

is then stored on another on-board frame buffer. These smart frame buffers

can perform post processing on the data for anti-aliasing and are also able to

drive the off-board displays again using the DVI specification.

27

3.2 Bus Dataflow

Encoded at the start of each rendering engine’s image is control infor-

mation that tells the input frame buffer which segments of the image should

be sent to which composers and where they should be placed in the final dis-

play. This work is done by the computer hosting the rendering engine since it

offloads the computational work to a full fledged CPU, which is more suited

to this task than the streamlined Metabuffer. The control information is sent

in tabular form, with one row corresponding to each image segment.

Dcomp Sx Sy Sdx Sdy Dx Dy Dmultiple Transparent1 0 0 75 75 25 25 1 1002 75 0 25 75 0 25 1 1003 0 75 75 25 25 0 1 1004 75 75 25 25 0 0 1 100

Table 3.1: Viewport control information

Table 3.1 shows some typical data describing a viewport configuration

(essentially the layout as described in section 3.2.1 later in this paper). Here,

the image and display size are assumed to be 100 pixels by 100 pixels. Dcomp

is the index number of the composer (or display) where the segment is to be

sent. Sx and Sy refer to the source coordinates of the segment in the rendered

image. Sdx and Sdy refer to the dimensions of the segment in the source

image. Dx and Dy refer to the destination coordinates in the display image.

Dmultiple is the replication factor of the source pixel. Since the ratio of source

to destination pixels is 1:1, this multiple is 1. Transparent refers to the special

28

patterns that are applied to pixel replication operations in order to provide

for screen door transparency. 100 means that the viewports are opaque. The

input frame buffer broadcasts the entire viewport table over the bus to the

composers at the start of each frame. Each composer then takes the entry

that it is responsible for and stores it locally.

3.2.1 Analysis of Bus Data Flow

One of the most interesting problems of this project is how to efficiently

transmit image data from the input frame buffers, through the bus, and then

to each composer. Since the composers are arranged in a pipeline fashion,

it is imperative that they have the data they need at the right time. If one

composer is missing its data, a glitch in the image will occur.

Since the Metabuffer employs viewports of varying size and position, it

is important to demonstrate that the bandwidth requirements of the composers

will not exceed the limited data rate of the bus that connects them to the input

frame buffers. If the bandwidth requirements are exceeded in certain viewport

configurations, glitches in the output image are certain to occur. The analysis

that follows proves that the Metabuffer has a constant bandwidth requirement

regardless of the size or orientation of the viewports that are used.

In order to analyze the worst case data flow of the board, a scheme

is used similar to the one presented in the paper by Kettler, Lehoczky, and

Strosnider [27]. Since all data needs are periodic (because of the raster display),

29

each task (display) can be described in terms of the amount of data needed

(C), its period (T), and its deadline (D). By quantifying these values for some

sample cases, it is easy to see that the bandwidth requirements do not change

as the viewport geometry becomes more complex.

For example, if we assume that the smallest viewport is the size of an

output screen (of w by w pixels), and that the viewports increase in size in

even multiples, observations for the following cases hold true.

Case One

1 1

Figure 3.2: Case one: single screen viewport

In figure 3.2 the input image is the same size as an output screen, but

only one composer is used. The ratio of pixels from input to output is 1:1, so

the composer requires a steady stream of data. As shown on the right, the

total bandwidth required is one screen full.

Data Period DeadlineC1 = w T1 = w D1 = w

Table 3.2: Case one: bandwidth analysis

This is the trivial case. Table 3.2 demonstrates that the data needed

(C) is equal to the period for the scheduling. A steady stream of data will

30

satisfy this.

Case Two

1 1

11

1

Figure 3.3: Case two: four screen viewport

Again, the input image in figure 3.3 is the same size as an output screen.

However, in this case four different composers require data. But, according to

the geometry of the display, only one composer will need data at any particular

time. As shown on the right, none of the composer viewport areas overlap.

They join together to form exactly one screen size. So, one screen size of data

is needed. The ratio of pixels from input to output is 1:1, and there is no

overlap, meaning only one pixel need be accessed on the bus at any one time.

Data Period DeadlineC1a = l T1a = w D1a = wC1b = w − l T1b = w D1b = w

Table 3.3: Case two: bandwidth analysis

The variable l in table 3.3 represents the vertical dividing line in the

row between tasks 1a and 1b. For the purposes of scheduling, the horizontal

divider is ignored, since this merely changes the display destination of the

data, and not the data timing needs of the system. Adding all of the data

31

values together (C) results in the same quantity as the period, which means

the bandwidth is constant compared to the previous case.

Case Three

1 2

3 4

1 2

3 4

Figure 3.4: Case three: four screen low resolution viewport

In figure 3.4, the input image is four times as large in order to form

a low resolution background display. In this case four composers will require

data, but they will all require data at the same time! As shown on the right,

four screen-fulls of data are required. However, the solution here is that the

ratio of input pixels to output pixels is 1:4. Thus, while four times the screens

are being created, they are being furnished with one fourth of the data. This

effectively means that the bandwidth requirements here are still constant. The

fact that four composers require pixel data at the same time is a problem,

but since the bandwidth requirements are scalable, a simple buffering scheme

should satisfy each of the composers.

Table 3.4 displays the results of this operation. Because pixels are being

replicated to twice their size, the period (T) of the scheduling increases by a

factor of two because there are half as many rows to process. Likewise, the

32

Data Period DeadlineC1 = w/2 T1 = 2w D1 = 2wC2 = w/2 T2 = 2w D2 = 2wC3 = w/2 T3 = 2w D3 = 2wC4 = w/2 T4 = 2w D4 = 2w

Table 3.4: Case three: bandwidth analysis

data needed (C) decreases by a factor of two. If all of the C values are totaled,

the result is 2w, which is the same as the period.

Case Four

1 1

1 1

1

2

2

2

3 3

3

4

4

Figure 3.5: Case four: nine screen low resolution viewport

Finally, as shown in figure 3.5, the input image is again four times as

large, but now it overlaps nine composers. From the right, it can be seen that

from these nine composers, only four screens simultaneously need to be placed

on the bus at the same time. And, from the analysis of case 3 in table 3.4, be-

cause the ratio of pixels is 1:4, there is one-fourth the bandwidth requirement.

Again, the bandwidth requirements remain constant. Since four composers

must simultaneously have data, the bus must be buffered. Successive cases of

33

larger viewports and more composers can be extrapolated in a similar manner.

Data Period DeadlineC1a = l/2 T1a = 2w D1a = 2wC1b = (w − l)/2 T1b = 2w D1b = 2wC2 = w/2 T2 = 2w D2 = 2wC3a = l/2 T3a = 2w D3a = 2wC3b = (w − l)/2 T3b = 2w D3b = 2wC4 = w/2 T4 = 2w D4 = 2w

Table 3.5: Case four: bandwidth analysis

As shown in table 3.5, because pixels are being replicated to twice their

size, the period (T) of the scheduling increases by a factor of two because there

are half as many rows to process. Likewise, the data needed (C) decreases by

a factor of two. If all of the C values are totaled, the result is 2w, which is the

same as the period.

3.2.2 Buffering of Bus Data Flow

As stated before, supplying a local buffer on each composer is neces-

sary to allow for simultaneous access of the image data. It also provides the

capability to do multiresolution pixel replication. The buffer that each com-

poser maintains closely resembles a queue, except for one important difference.

While the buffer acts in a FIFO manner when Dmultiple is 1 (the source pix-

els and destination pixels are in a 1:1 ratio), if pixel replication needs to be

done, it is necessary to remember data from the previous row. If advanced

smoothing is being performed then multiple rows may be needed. Therefore,

34

the cache behaves like a queue, but also has a moving window of data that

always stores the previous source row of at least size Sdx.

3.3 IRSA Round Robin Bus Scheduling

In order to send data to the composers in a simple, yet good performing

manner, an idle recovery slot allocation (IRSA) round robin approach [27] is

employed which distributes data to the composers evenly based on the amount

of data needed (C), the period (T), and the deadline (D). No effort is made to

look ahead in the geometry of the viewports to find the most efficient way to

send the data out. However, because of the previous discussion, the uniformity

of the data transmitted to each buffer will result in few delays using this simple

method.

In the event that a composer-side buffer becomes too full to cope with

the data, the round robin scheduler performs an idle slot recovery operation.

The composer receiving data drops a bit defined as BUSREADY on the bus

for one clock cycle. Once the input frame buffer reads the low BUSREADY

bit, it stops sending data to that composer and jumps to the next scheduled

segment in the table. This way other composers can utilize the unused time

on the bus. The scope of the BUSREADY bit will be limited by the fanout of

the bus, but this is true of the bus in general, and the low number of displays

typically used should not cause a problem here.

35

3.4 Sequence of Metabuffer Operations

For each frame, the Metabuffer follows a sequence of steps in order to

compute the final collaged output display. In order to synchronize themselves,

the pipeline composers and output frame buffer employ a PIPEREADY bit to

communicate with each other. The details of this method follow below:

1. Frame Transition: Input frame buffers finish the previous frame, switch

to next frame, and start feeding data to the composers.

2. Waiting for PIPEREADY: At this stage, composers have not re-

ceived a PIPEREADY bit bubbling up from the composers in the pipeline

below, but accept data until their internal buffers are entirely full with-

out transmitting any data for this frame (though the previous frame

could still be in computation) down the pipe.

3. Buffers Are Filled: When the internal buffers of the composers become

full, each drops the BUSREADY bit on each transmission request from

the input frame buffers, effectively stalling the Metabuffer.

4. Output Frame buffers Signal Completion: When the output frame

buffers realize that they have finished building the old frame, they switch

to a new frame and send a high PIPEREADY bit to the previous com-

poser.

5. Composer Relays Finish Signal: When a composer gets a PIPEREADY

bit from the following composer (or output frame buffer), it checks to

36

see if its internal buffer is fully prefetched solely with the data from the

new frame (all data from the old frame has been cleared out). If so, it

relays the PIPEREADY bit to the previous composer in the pipeline. If

not, it stalls until it is entirely prefetched.

6. Master Composer Signals Start of Frame: Once the PIPEREADY

bit gets to the master composer (the composer at the top of the pipeline),

and the master composer is ready, everything is set for that pipeline

to begin computation of the next frame. The master composer starts

the frame by sending a STARTFRAME bit down the pipeline and then

streaming out data.

7. Composers in Pipe Begin Frame: The other composers in the

pipeline, once they read the STARTFRAME bit, relay that bit down

the pipeline and begin their computation. The STARTFRAME bit is

important because it automatically establishes each composer’s position

on the pipeline (since each successive composer must be offset one cycle

to be synchronized). Only the head composer at the top of the pipeline

needs to be initialized via a PIPEMASTER bit set via a jumper when

the circuit board is installed.

8. Input Frame buffer Streams Out Data: Now that the pipeline

is started and data is flowing, the input frame buffer will no longer

get BUSREADY low bits, and can resume streaming data out to the

computing composers in a round robin fashion.

37

Now that data is flowing through the busses and the pipeline, each

composer, using an internal index of the output display, determines if the seg-

ment it is responsible for intersects the current coordinates. If so, it attempts

to fetch the proper pixel information from the cache and compares it to the

Z value of the previous pixel in the pipeline. Once an entire display has been

sent to the output frame buffer, the process repeats itself.

3.5 Conclusion

The Metabuffer provides for leveraging today’s commodity PC technol-

ogy to construct cost-effective, parallel high-end graphics rendering systems

with multidisplay capability. It has the advantages of easing load balancing

by providing a uniform display space abstraction to the software, supporting

multiresolution and foveated display, and providing a scalable platform with

no changes to stock hardware. It does require the development of non-trivial

custom hardware to perform image compositing. However, a parallel effort

at Stanford University has been able to design hardware that can support a

version of this type of image compositing [20]. Fortunately, most of this work

can be done without resorting to custom VLSI, at least for prototypes.

The Metabuffer can also hope to avoid the fate of so many parallel

architecture projects in the past, in which the development of custom switch-

ing hardware took so long that the advantages of parallel computation were

swamped by the rapid development of commodity semiconductor technology.

38

This is not only through avoiding using custom silicon, but also because the

hardware is designed to handle video standards, which change more slowly

than processor and system clock speeds. A Metabuffer system will be usable

with many future generations of processors, even with a slower development

cycle.

39

Chapter 4

Metabuffer Simulator

4.1 Introduction

Because of the complexity of the Metabuffer a prototype of it has been

built in software. This prototype is modeled as closely as possible to the oper-

ation of the Metabuffer architecture discussed previously in this paper. Since

this software prototype will be the basis for the first hardware implementation

of the Metabuffer, all coding was done strictly with the Metabuffer architecture

in mind.

By building the prototype in software first, it is possible to do much

more extensive testing and to try many more design alternatives in the same

amount of time than with hardware. Changing a signal or reworking an al-

gorithm means only recompiling the source code, instead of rewiring a circuit

board or burning another FPGA. Also, with a software prototype, a Metabuf-

fer consisting of hundreds or thousands of rendering engines can be simulated.

Building a prototype Metabuffer of that size in hardware would require an

enormous amount of resources.

Although the software prototype cannot operate in real time, it can be

40

used to thoroughly simulate the operations of the Metabuffer. Just about any

aspect of the design can be programmed and evaluated. New algorithms can

be tested on the prototype just as if they were encoded into a DSP. Likewise,

applications that use the Metabuffer can be tested at an early stage with

the software prototype to solve design issues, taking into account that the

final hardware version of the Metabuffer will offer more performance, while

operating the same.

4.2 Implementation

The Metabuffer software prototype was completed in C++ since the

highly modular design concept lends itself to the use of object oriented pro-

gramming. Each module (input frame buffer, composer, and output frame

buffer) is defined as a separate C++ class. The data hiding capabilities of

object oriented programming mean that it is possible to create a large Meta-

buffer with possibly thousands of composers simply by replicating one class

over and over again. Also, once the class is defined, changing the layout of the

Metabuffer simply means adjusting the number of frame buffers and composers

being used via the creation or deletion of class instances.

Each class used in the Metabuffer simulator runs in its own pthread. All

the classes are synchronized by a global clock. In hardware, this clock would

be a signal on the bus. In software, the high to low and low to high clock

transitions are implemented by a barrier written using pthread primitives.

41

CClock

CInFrameBus

CInFrameBus

CComposerPipe CComposerPipe

CComposerPipe CComposerPipe

COutFrame COutFrame

Figure 4.1: Simulator class instance organization

These barrier calls are placed in a separate class called CClock and is referenced

by all the other components in the system. A diagram showing both the

layout and the dependencies of the class instances for a Metabuffer simulator

consisting of two renderers and two displays is shown in figure 4.1. In the

appendix A each of the classes shown is fully documented.

4.3 Multiresolution Output

In order to test multiresolution support of the software prototype of

the Metabuffer, it was necessary to obtain a source of rendered images and Z

order values. Eventually this data will come from the digital output of COTS

rendering engines. For these particular tests, images and Z order values were

generated using the Rayshade ray tracer. Reading an image in TIF format

42

and the Rayshade generated Z order information into the input frame buffer

class simulates the transmission of a frame of RGB data and a frame of Z order

data from the rendering engine.

Figure 4.2: Rayshade generated input images with viewport configuration

The images in figure 4.2 show the TIF images that were rendered using

Rayshade: a ball, a tube, and finally a seascape. The final diagram illus-

trates how these images were distributed to the four output displays by being

broken up into viewports. Note that every image is sent to at least two out-

put displays. As discussed earlier in this paper, the location and geometry of

the viewports is arbitrary. The bandwidth requirements over the bus remains

constant.

Running the three images into a three input frame buffer by four output

frame buffer Metabuffer yields the four output screens in figure 4.3. Note that

43

Figure 4.3: Composited simulator output images

the tube resides in four separate displays, despite being rendered on a single

machine. Also, see how the seascape here is being used as a low resolution

background display with the higher resolution foreground images layered on

top. Finally, the Z order of the input images is always taken into account,

whether that means that the ball is in front of the tube, or that the ocean

surface laps at the base of the foreground objects.

4.4 Antialiasing Output

One problem with compositing separate images like the ones above is

the aliasing that results on the edges. A solution that has been implemented

involves supersampling. Simply increasing the detail of the input images and

then having the output frame buffers average the pixel values down to the

44

original size effectively smooths the image. Only the problem pixels at the

edges are affected. The rest of the composited image pixels remain as sharp

as on the original.

This technique is commonly used in graphics cards to antialias displays.

It is extremely simple, since the only major change to the graphics pipeline,

besides the increase in resolution, is an averaging step at the very end. The

main disadvantage is the fact that the graphics hardware has to run so much

faster in order to generate the extra pixels. This is not much of an issue inside

the tightly coupled hardware of a graphics card. In a more loosely coupled

system like a cluster these heightened bandwidth requirements could be a

problem. But, even with the bandwidth concerns, supersampling has been

implemented in PixelFlow, another sort last system similar to the Metabuffer.

Figure 4.4: Zoomed image without (left) and with (right) antialiasing

The two images generated by the Metabuffer in figure 4.4 (magnified

eight times to show the difference in detail) demonstrate the effect supersam-

pling has on the resulting image quality. On the left, no supersampling has

been performed. There is a jagged transition between the different input im-

45

ages at the Z buffer transition. On the right, the input images were rendered

to be four times as detailed and the final output pixels were averaged by the

output frame buffer from the four nearest pixels that traveled through the

composer pipeline. The jagged transition is now much smoother while the rest

of the image has lost no quality.

4.5 Transparency Output

A major issue for sort last parallel rendering systems is transparency.

In sort first systems, a region in the display space is assigned to a single com-

puter. That machine can easily make the calculations necessary to create

transparency in that single area. With last systems, though, many machines

may be contributing polygons to form a single region in the display space.

Some of those polygons could be opaque and some could be transparent. Poly-

gons could be of varying depth on different machines resulting in interleaving.

Also, polygons are seldom sorted back to front in the compositing chain. This

dissertation discusses three different methods used to create transparency on

sort last systems: interpolated transparency, multipass, and screen door. It

includes the reasoning for using the screen door implementation on the Meta-

buffer and gives examples of its output.

46

4.5.1 Interpolated Transparency

Interpolated transparency is represented by the equation 4.1 as stated

by Foley [15].

Iλ = (1− kt1)Iλ1 + kt1Iλ2 (4.1)

The transmission coefficient kt1 measures the transparency of the poly-

gon in the foreground. The final pixel color is achieved by using this coefficient

to linearly interpolate the color contribution of the polygon in the background,

Iλ2, with the color of the transparent polygon in the foreground, Iλ1.

The primary problem with interpolated transparency as it relates to

sort last systems is that it is not commutative. For the technique to work

properly, polygons must be correctly sorted from back to front. Typically sort

last systems allow interleaving of polygons belonging to multiple machines.

This interleaving information is many times lost by the time the viewport of

the machine is rendered. Only the topmost polygons and Z values remain.

Therefore, strict rules regarding the grouping of polygons must be followed

for it to work on a sort last system. These restrictions destroy much of the

flexibility sort last systems give for load balancing.

With the Metabuffer system, another concern is pipeline ordering. Poly-

gons need to be sorted from back to front. That means that distant polygons

47

must be at the head of the pipeline and the closest polygons should be at the

tail. If the user were to rotate the data set 180 degrees, almost the entire data

set would need to be reshuffled to comply with the sorting assertion.

The Sepia system, however, is excellently equipped to deal with these

issues. As mentioned previously, Sepia uses ServerNet II to form its composit-

ing pipeline. ServerNet II has the advantage that it can be reconfigured on the

fly to change the routes that packets take within the system. A compositing

pipeline can be reordered upside down simply by changing the ServerNet II

routes.

This is the method employed to render volumes on the Sepia system

[31]. A cubed data set is subdivided into 8 pieces. These pieces are rendered

separately and then blended together using the Sepia system. Depending on

the user’s viewpoint, the ServerNet II network adapts to put the pieces in the

correct back to front ordering. Because the pieces do not overlap and have no

interleaved polygons, changing the compositing routes is sufficient to satisfy

back to front sorting. A similar method is employed by Muraki on an image

compositing system using a prioritized binary tree method [40].

4.5.2 Multipass Methods

Mammen [34] describes a method to render transparency in multiple

passes. His technique removed the need to sort the polygons from back to front,

but does introduce more complexity due to requiring multiple steps. After all

48

of the opaque polygons are rendered to a Z buffer, the algorithm goes through

an iterative process to determine which of the transparent polygons is furthest

back but still visible. The transparent effects of that polygon is contributed to

the rendering and the process is repeated until all of the transparent polygons

have been taken into account.

Multipass transparency is slow and complex, but yields excellent result.

PixelFlow uses this technique, but employs a special library to isolate the

programmer from the difficulties of implementing the operation.

4.5.3 Screen Door

Just as the name implies, with the screen door method of transparency

instead of treating polygons as transparent, they are simply rendered with a

portion of the pixels dropped to allow the background to show through. The

more pixels dropped, the more transparent the polygon appears. Because the

screen door effect is fully recorded by the Z buffer, this technique is nether

dependent on compositing pipeline ordering nor polygon sorting order making

it ideal for sort last architectures.

4.5.4 Metabuffer Implementation

Screen door was chosen for the Metabuffer primarily because of the

flexibility it gives regarding the ordering of the compositing pipeline. Unlike

Sepia, with its configurable ServerNet II network, the Metabuffer’s pipeline

49

is fixed in hardware. But since the screen door algorithm requires no poly-

gon sorting, changing user viewpoints will not require shuffling the data set

and thus will not adversely affect the frame rate. Another advantage is that

the Metabuffer system already uses pixel replication for multiresolution and

employs supersampling for antialiasing. This abundance of redundant pixels

makes it quite easy to create screen door masks without affecting the quality

of the image. For instance, on non-supersampled viewports, each pixel is repli-

cated four times and then averaged down to one pixel on the final display. By

employing a simple checkerboard mask on the replicated pixels, the averaged

output pixel correctly achieves a 50% transmission coefficient. An example

using this method is shown in figure 4.5.

Figure 4.5: Screen door transparency Metabuffer output

50

The screen door technique is not without its problems. Because the

Metabuffer only employs 4x supersampling, transparency can only be quan-

tized into four levels. Also, if multiple transparent layers of polygons overlap,

the screen door patterns may interfere with each other creating undesirable

effects. Figure 4.6 is a zoom of figure 4.5 showing how the ball completely

obscures the tube behind it as a result of these mask collisions. In addition,

performing the screen door mask on replicated pixels will produce problems if

polygons from different machines interleave, since only the front-most Z val-

ues for each machine’s viewport are recorded. However, if these limitations are

taken into account, screen door is an adequate way to achieve transparency.

Figure 4.6: Zoom of transparency example

4.6 Distribution

The Metabuffer simulator included in the distribution has been tested

and run primarily on Windows NT. However, it has been ported to IRIX and

51

should run on any system that has a pthreads compliant library installed.

The Metabuffer simulator distribution consists of three main parts.

The first is the actual source code for the component classes. Included here

is code for supporting classes that form wrappers around the synchronization

primitives. This helps to make the code more cross platform if another thread

library is used instead of pthreads.

Also included in the distribution is a Windows implementation of the

pthreads library [26]. Windows has its own threading model and does not

implement the pthreads standard. Normally, this would be fine and changing

the synchronization classes to Windows functions would port the code. How-

ever, the clock emulation relies on a barrier class which is formed by using

conditional variables. Windows does not support conditional variables. The

pthreads library included here for Windows implements conditional variables.

This is not a trivial task and actually requires timeout parameters to pre-

vent deadlock. By using the pthreads library instead of the native Windows

threading model, barriers can be correctly implemented.

Finally, a version of the libtiff library [30] is included in the distribution

for reading and writing images. Source images generated by Rayshade are read

into the Metabuffer simulator as TIFs. Likewise, output images generated by

the Metabuffer simulator COutFrame classes are written as TIF files.

52

4.7 Conclusion

The Metabuffer simulator provides a valuable testbed for testing image

compositing ideas at the granularity of the bus clock. By using the Metabuffer

with test images in numerous different viewport combinations, it proves that

the Metabuffer can generate glitch free output images and thus shows that

bandwidth requirements are constant no matter what the viewport arrange-

ment is for the scene.

53

Chapter 5

Metabuffer Emulator

5.1 Introduction

In order to provide an interactive testbed for writing applications for

the Metabuffer system, an emulator was written in software that would mimic

the operations of the Metabuffer while attempting to run as fast as possible.

The Metabuffer emulator essentially produces the same output as the hard-

ware level simulator, except it is not constrained to work as the Metabuffer

hardware would. Thus it can be optimized to run as fast as possible on the

host architecture.

The host system for the Metabuffer emulator is a Beowulf cluster con-

sisting of 128 networked Compaq computers running the Linux OS. Each ma-

chine contains an 800 MHz Intel Pentium III 256K L2 processor and 256 MB

RDRAM. 32 of these machines are equipped with high performance Hercules

3D Prophet II GTS 64 megabyte DVI graphics cards. Furthermore, 10 of these

graphics cards are linked to a 5 by 2 tiled projection screen display in the UT

visualization lab.

The Metabuffer emulator uses MPI for communication on the cluster

54

and a slightly modified version of the GLUT [28] library for doing all graphics

rendering and display. Instead of sending image data out of the DVI port to

the Metabuffer hardware, the Metabuffer emulator reads back the pixel infor-

mation from graphics cards belonging to the Beowulf machines using OpenGL

glReadPixels() calls to the GLUT window. This image data is then sent over

the network (instead of through the Metabuffer I/O lines) via MPI and com-

posited by other machines (instead of using the Metabuffer pipeline) in the

Beowulf cluster. These compositing machines also display the final images on

the projection screen display again using the GLUT library.

5.2 Implementation

5.2.1 Granularity

The primary reason that the Metabuffer emulator is faster than the

Metabuffer simulator is granularity. The Metabuffer emulator uses the MPICH

library for communicating data between the machines in the Beowulf cluster.

In the case of the simulator, each component is synchronized with the other

via a global bus clock. No matter how the workload is divided, the machines

doing the processing still have to synchronize themselves to this clock. As a

result of this fine level of granularity, millions of synchronizations are needed–

one for each pixel. For example, a version of the Metabuffer simulator ported

to use MPI required five minutes to complete a single frame.

The Metabuffer emulator, on the other hand, performs all of the work

55

at the granularity level of the frame. It disposes of the CComposerPipe code

used to process the pipeline pixel by pixel and instead sends whole buffers

of image data directly from the CInFrameBus renderers to the COutFrame

machines which now are responsible for both compositing and displaying the

output.

COutFrame COutFrame

CInFrameBusCInFrameBus

Figure 5.1: Emulator class instance organization

Figure 5.1 shows the class instance dependencies for a Metabuffer em-

ulator consisting of two renderers and two displays. Each renderer sends one

message to every display in the system. This message contains whatever image

fragment it is contributing to that display. The displays receive all the image

fragments and piece them back together again to form the final image.

Looking at the cross hatching of messages from CInFrameBus renderers

to COutFrame displays immediately reveals that the Metabuffer architecture

does not map well to a common PC cluster. Each renderer must communicate

with all the displays in the system, which can result in very high communica-

tion requirements. Even more problematic, if several rendering machines send

all their data to one display machine, that display machine will be severely

56

overloaded with compositing duties. As a result, the Metabuffer emulator

running on the Beowulf cluster is not scalable to a large number of machines.

The Metabuffer hardware solves all these issues by using high band-

width parallel I/O and compositing pipelines consisting of many compositing

processors. COTS PCs connected in a cluster exhibit none of these quali-

ties. The bandwidth of the communications network is limited. Also, though

each machine does have a very powerful processor on board, it cannot match

the efficiency of multiple smaller processing blocks in a pipeline arrangement.

Still, even though limited in the number of machines that can be used, the

Metabuffer emulator does achieve interactive frame rates for exploring new

applications for the Metabuffer hardware.

5.2.2 MPI Mapping

One of the biggest issues with writing the Metabuffer emulator was

mapping certain machines to specific MPI processes. Each COutFrame com-

ponent needed to be running on a specific cluster machine that was connected

to a specific display. Otherwise the tiling could not work.

Computation clusters consisting of PC workstations are seldom equipped

with graphics cards and even more rarely are they connected to graphics dis-

plays. Usually each machine is the same as any other. MPI assumes this and

does not offer any way to bind certain machines to certain processes.

57

In order to overcome this, during initialization the Metabuffer emula-

tor performs an all-to-all broadcast of each MPI process’ machine name. With

this information, each process in the Metabuffer emulator dynamically deter-

mines its role. Processes that are connected to displays automatically use the

COutFrame code and assume the correct position in the tiling. Processes that

are not connected to displays use the CInFrameBus code and also calculate on

which machines the displays are located in order to send their image fragments.

5.2.3 Plugin API

Programming the emulator consists of writing just three functions which

are then linked into the existing code.

InitRenderer(): This function is called at the initialization of the emulator.

It passes to the user code the renderer number (0 to NUMINPUTS-1),

an MPI communicator containing all the renderers in the system (for use

in load balancing operations), and the argc and argv parameters passed

in from the mpirun command line.

GetRendererData(): This function is called at the start of every frame.

The location and resolution of the viewport is requested, along with the

RGB and Z data contained in the renderer’s viewport.

UpdateRenderer(): Since the MPICH implementation does not currently

support multithreaded processes, this function is called multiple times

58

during the image compositing to allow user code to process any message

queues or do other housekeeping tasks. If not needed it can be set to an

empty function.

The two multiresolution techniques discussed in this dissertation were

both coded as plugins for the Metabuffer emulator. The advantage of splitting

the application code from the emulation code with this strict API is that these

Metabuffer emulator plugins can then very easily be made into applications

that interact with the actual Metabuffer hardware. The only requirement

would be to replace the Metabuffer emulator code with the code required to

interact with the hardware. The very same plugin API could still be used.

5.3 Distribution

Although the code in this distribution has been tested only on Linux

clusters, it should be portable to just about any OS. The emulator relies heav-

ily on the MPI, GLUT, TIFF, and OCview libraries, all of which have been

compiled for many different operating systems. The actual Metabuffer emula-

tor code should be very cross platform.

Again, for reference, the cluster here at UT consists of 32 Linux ma-

chines equipped with Hercules 3D Prophet II GTS graphics cards. 10 of these

machines are connected to a 5 x 2 tiled projection display in the visualization

laboratory. The other 22 graphics cards are used only for rendering.

59

The Metabuffer emulator uses MPI to communicate between the ma-

chines and OpenGL to work with the graphics cards. A typical session has

10 of the machines rendering polygons and 10 others doing the Z depth com-

positing and ultimately displaying the graphics on the projectors.

The software should run on any Linux based cluster that has some

version of MPI (practically standard on most computing clusters) and runs

XWindows with support for OpenGL. Appendix B includes more details on

creating the emulator executable.

5.3.1 Plugins

Three plugins are in this distribution. They are located in the meta/plugin

directory. To change plugins, simply copy them to the meta/emu directory

and change their name to plugin.cpp. The distribution initially has teapot.cpp

as the plugin.cpp file.

1. teapot.cpp The famous Utah teapot bounces around the tiled display.

This plugin is the simplest because it does not use the OCview rendering

library and doesn’t need the metadata part of the distribution.

2. ducksetal.cpp Similar to the teapot, but instead of teapots, small

OCview objects move around the screen.

3. progressive.cpp This plugin is the progressive image composition ex-

ample. A 9.2 million triangle isosurface extraction of the visible human

60

data set is split into 10 pieces of 920,000 triangles. Each piece is ren-

dered by a different machine and then composited together to form the

entire image. The pieces are first cycled in a circle to show they are

individual and can move anywhere in the display, then they are put to-

gether, zoomed, and rotated. The resolutions of the parts change if their

triangles cannot fit within high-resolution viewports. This way no poly-

gon or pixel information needs to be communicated between machines

and frame rates remain constant. The plugin does not contain code to

rebalance the triangles in order to regain high-resolution viewports for

different views. Editing the plugin.cpp source code and changing the

DATASET #define allows either the VIZHUMAN, SANTABARBARA,

or OCEAN data sets to be viewed.

4. fovea.cpp For the foveated vision plugin, the renderers are assigned ar-

eas of the screen according to where the user is currently gazing. The

majority of renderers draw the region where the user is focused. At the

same time, the minority of renderers concentrate on drawing the periph-

ery. This smaller number of processors can render the larger area because

they are working in low resolution. Since human peripheral vision lacks

detail there is no reason to render this area with as much acuity as where

the user is focused. Using the same argument, these renderers also deal

with decimated data sets to reduce their polygon counts to manageable

sizes. Again, the periphery is not sensitive to this loss of detail. The

result of this is that the user is presented with a high resolution region of

61

interest and a constant frame rate, no matter what viewpoint is chosen.

By editing the plugin source and changing the DATASET #define the

user can view either the VIZHUMAN, SKELETON, or ENGINE data

sets.

Writing a custom plugin simply means creating the three functions

specified by plugin.h in a plugin.cpp file and linking it in. A plugin does not

have to use the GLUT library or OCview. It can use anything that will provide

a source of RGB and Z information.

5.3.2 Future Work

Unlike the Metabuffer hardware simulator, currently the Metabuffer

emulator does not support supersampling. This means neither antialiased

supersampled viewports are possible nor is screen door transparency.

5.3.3 Undocumented Features

This emulator is constantly evolving and there are several useful fea-

tures buried in the code that might be useful for other developers. In CIn-

FrameBus.cpp the #define SHOWVIEWPORT turns off or on black rectangles

that marks the viewport locations on the output displays. In CoutFrame.cpp

the #define SAVEOUTIMAGE will save the output image that the machine is

showing on the tiled display wall in that machines /tmp directory. Collecting

images from all the output machines and running them through metapaste.c

62

in the meta/tools directory will combine them into a single image which could

then be made into an AVI. Likewise, in the plugins, the #define SAVEFRAME

will save the rendered viewport into the /tmp directory. The plugins also

support a stand-alone mode if make -f Makefile.sa is used. This allows the

individual rendering machine to run sans MPI and display its viewport on the

local display. This can be useful for debugging.

5.4 Conclusion

The emulator presented in this chapter allows for the development of

full featured applications for the Metabuffer architecture. While it cannot

approach the performance of the Metabuffer hardware, the software emulator

gives good enough speed to allow interactive testing of Metabuffer applications.

Once written for the Metabuffer emulator, applications can easily be

ported to work with the Metabuffer hardware. At most, a simple library

should be all that is needed to abstract the interaction with the video card

frame buffer to that of the Metabuffer emulator plugin API.

63

Chapter 6

Greedy Viewport Allocation Algorithm

6.1 Introduction

Given a triangular mesh, it is very important to distribute the triangles

properly in order to achieve a good load balance among parallel rendering

servers. A parallel system is only as fast as its slowest member, so ensuring

that the work is evenly distributed is paramount to obtaining good timings

and therefore good speedups and processor utilization efficiency.

This chapter explores the problem of load balanced triangular mesh

partitioning for the rendering servers of the Metabuffer. The goal is to dis-

tribute the triangles in a mesh in such a way so that all the triangles are

rendered by one server, they are evenly balanced, and that each grouping of

triangles is located within a screen sized area in the overall display space.

The last issue is very important in order to create a fully high resolution

display since the renderer graphics cards are limited to a screen worth of

output image data. Only if each group of triangles can fit completely within

the frame buffer of each graphics card can a completely high resolution display

be composited together. The multiresolution capability of the Metabuffer will

64

be exploited later in chapter 8 to provide time-critical progressive rendering

with constant frame rates while the user is aggressively panning and zooming

the scene.

6.2 Background

6.2.1 Sort First Algorithms

Samanta [44] discusses several partition algorithms for the SHRIMP

sort-first system. As a sort first system, these algorithms attempt to find

nonoverlapping regions of screen space that can be distributed among proces-

sors so that each machine will have an equal rendering load. The algorithms

used by SHRIMP are grid bucket, grid union, and kd-split.

Grid Bucket

In the SHRIMP implementation of the grid bucket algorithm, the entire

screen space is divided up into squares. Groups of squares are then assigned

to renderers in an evenly balanced way in order to load balance the rendering

work. A heuristic is used to estimate the costs associated with having a par-

ticular square rendered by a particular machine. In the case of SHRIMP, these

costs can be significant, since pixels must be transferred for every square that

is not rendered on the machine driving the squares display. Using the polygon

distribution and these statistics, the squares are divided evenly.

65

Grid Union

The grid union algorithm tries to improve on one of the main defi-

ciencies of the grid bucket algorithm as relating to the SHRIMP sort first

architecture. Dividing the screen space up into small squares and then assign-

ing those squares to different rendering machines means that many polygons

located on the edges of the squares will have to be rendered twice. To prevent

this, the grid union algorithm attempts to merge adjoining squares on the

same renderer. Thus, there will be fewer polygon overlap penalties.

KD-Split

The kd-split algorithm avoids the overhead of partitioning the screen

space into many very small squares and instead recursively partitions it–first

in one dimension and then in the other. For example, for a given screenful of

polygons, the algorithm determines where in the display a vertical line would

divide the image evenly in terms of polygon rendering time. The amount of

rendering work on the left would be equal to the amount of rendering work on

the right. Next, two horizontal lines evenly divide each of the evenly divided

halves. This is done successively until the screen space is partitioned into the

correct number of tiles needed for the number of renderers.

The kd-split minimizes the amount of polygon overlap due to the fewer

number of partitions. However, keeping the rendering workload local to the

display machine is problematic for the SHRIMP system. The kd-split algo-

66

rithm also has the effect of generating partitions of varying sizes. In some

cases the partitions could be bigger than the rendering capabilities of the

graphics cards used on the machines, necessitating further subdivision. Still,

the kd-split algorithm usually performed the best in the testing presented in

Samanta’s paper.

6.2.2 Sort Last Techniques

As a sort-last image composition system, the Metabuffer has a few

more freedoms compared than SHRIMP. First, it allows overlapping images

rendered by different processors. This provides more flexibility in assigning

rendering processors to image space. It eliminates the polygon overlap over-

head that SHRIMP encounters when it needs to render the same polygon twice

on adjoining regions belonging to different machines. Second, the fact that

Metabuffer viewports can be located anywhere on the overall display space

means that the pixel redistribution overhead seen in SHRIMP is also gone.

However, as a result of its architecture the Metabuffer has a constraint

that does not severely affect SHRIMP system. As stated previously, in order

to obtain a high resolution display, and to obtain results comparable to that

of SHRIMP, every viewport must be the size of a single display tile. The use

of multiresolution can temporarily avoid this constraint, but it is a necessary

requirement to get a high resolution output. The SHRIMP system also must

abide by this constraint, but it can subdivide large regions and render them

67

separately if needed, while the Metabuffer does not have this option.

The additional freedoms and the additional constraint imposed by the

Metabuffer means that the polygon assignment algorithms for the SHRIMP

system are not applicable for the Metabuffer architecture. Instead of dividing

the screen space into regions of varying size as is the case with SHRIMP, what

is really needed is an algorithm that fully covers the polygons with squares

(viewports) of constant size (resolution).

Shifting Strategy

The conditions and constraints for this problem are analogous to the

covering with squares problem. The covering with squares problem can be

stated as follow: Given n points on a grid, find the smallest set of squares s

of a certain size covering all those points.

The Metabuffer viewport algorithm differs slightly from the covering

with squares algorithm. Instead of a minimal number of s squares, the Meta-

buffer case requires a constant number of v viewports, where v is the number

of viewports (renderers) available, and typically v >> s. Also, each Metabuf-

fer viewport must cover an equal number of polygons, where in the covering

with squares algorithm it is only required that the union of the squares cover

all the points. Still, given the solution to the covering with squares prob-

lem, it should be straightforward to determine the solution to the Metabuffer

viewport problem.

68

Because the covering with squares problem is strongly NP-Complete,

research has concentrated in finding algorithms that give approximate solu-

tions. Hochbaum [22] presents a bounded error approximation algorithm to

solve this problem using a shifting strategy. Likewise Gandhi [18] shows a

shifting strategy solution for a partial covering variant of this problem.

The basic concept of the shifting strategy is divide and conquer. Instead

of dealing with the entire screen space and finding the optimal covering using

brute force, the screen is divided into smaller parts with a smaller search space.

Even though the solution determined from these smaller search spaces may not

be optimal, it can be proven that it is optimal within a bounded error amount.

In the algorithm, each dimension is treated individually. The display

space I is divided strips of D width. The size of the search space is determined

by l, the number of contiguous strips that are used in the search. If l contiguous

strips are used, there are l different ways to assign the strips in lD widths (in

essence, shifting over each time a D amount). All l of these ways are searched

in a brute force manner to determine the best covering. The smallest result

from the l outcomes is used for the final answer. This is repeated for each

dimension.

Hochbaum shows that such an algorithm runs in O(ldn2ld+1) time where

l is the number of contiguous strips considered, n is the number of points, and

d is the dimension.

69

6.3 Implementation

Even though the shifting strategy gives a bounded error approximation

to the covering with squares problem, it is too slow for dealing with the large

numbers of polygons in the Metabuffer viewport problem. From the order

analysis, with two dimensions and the smallest and most error prone l of one,

the algorithm is still O(n5). Given a large data set where n could be millions

of polygons, the time required to compute the covering would be quite large.

Because of this, a much simplier greedy algorithm is presented in this

dissertation. While it cannot guarantee a bounded error answer, it does run

in O(nlog(n)) average time. In fact, for large numbers of polygons, the most

computation time is required by the quicksort, which although is O(n2) in the

worst case is typically closer to O(nlog(n)) for an average run. The greedy

algorithm for load balancing a mesh into a set number of viewports for the

Metabuffer is described formally as follows.

Conditions

1. There exists a screen of n tiles and m rendering servers (m ≥ n). Each

tile has the same size of w × h pixels and each server has the same

rendering capability c triangles/second.

2. There are p triangles that project into the screen. We assume each

triangle takes the same amount of time to render.

70

Constraints

1. To compare fairly with the results of [44], the viewport size of each server

should be the same as the size of a tile, w× h. However these viewports

can overlap each other, which differs from the sort-first approach of[44].

2. Every triangle should be covered by the union of viewports, and rendered

by at least one server. If parts of a triangle are rendered by different

servers, the triangle is counted multiple times.

3. A triangle can only be rendered by servers whose viewports cover the

triangle. In other words, there is no communication of pixels between

different servers.

The goal is to find the best placement of the viewports and assignment of

triangles to viewports such that the triangles are rendered in the shortest

time.

Lemma 1. If the total number of triangles is p and each rendering server can

render c triangles per second, the best possible time is pm×c . The worst case

time is p(m−n+1)×c . It happens when almost all triangles project to a single tile

but few triangles scatter in other tiles such that n viewports are necessary to

cover all triangles.

Any other cases can be rendered in p(m−n+1)×c time, because there are

at most (m−n) viewports that have more than pm−n+1 triangles after we cover

71

the whole display with n viewports. Those extra triangles can be assigned to

the remaining m − n viewports such that each rendering server has no more

than pm−n+1 triangles to render.

The steps of the proposed greedy viewport allocation algorithm are as

follows:

1. The center of mass of triangles is found by taking a weighted average of

the two dimensional coordinates of the projected bounded box for each

triangle.

2. The triangles are sorted by the distance to the center of mass.

3. In the order of decreasing distance, each triangle is assigned to a view-

port. If no viewport can cover the triangle, a new viewport is created. If

multiple viewports can cover the triangle, the triangle is assigned to the

viewport with the least mobility (ie the viewport whose previous triangle

assignments allow it to be moved the least). If a viewport has a triangle

count a predefined percentage higher than the optimum average polygon

load, it is closed for additional triangle assignment.

4. A final series of passes are made over the triangle list during which time

viewports with a higher than average number of triangles attempt to

assign their triangles to viewports with lower than average counts.

The strategy this algorithm employs is to assign the far flung triangles

72

Figure 6.1: Viewport configuration for horse example.

first, since the area in the center of the image is most likely to be covered

by a high number of viewports while the edge region is less probable to have

many choices in coverage. The algorithm also attempts to maintain the highest

degree of mobility for each viewport for as long as possible. This means that

there will be more viewport choices for triangles later in the assignment chain.

The algorithm requires a sort of all triangles in the scene, which provides a

lower bound on its complexity. Using a large number of renderers (and thus a

large number of viewports) the algorithm is O(pm).

6.4 Results

Figure 6.1 shows how the triangles of a horse model are divided into

eight viewports in an eight renderer Metabuffer configuration. The rectangles

73

show the computed viewports, which have the same size but can be positioned

arbitrarily in the image space. The colors of the triangles illustrate which

viewport to which they have been assigned. The total number of triangles in

the horse model is 22,258. The number of triangles in each viewport varies

from 2,782 to 2,783.

The load balancing algorithm took 0.051 seconds to process the horse

model: 0.008 seconds to compute the center of mass and distances, 0.019

seconds to sort the triangles, and 0.024 seconds to assign the viewports. The

concentric circles show the distance from the computed center of mass and

the rectangles show the computed viewports. It is obvious from the radiating

triangle assignments that the algorithm depends heavily on the distance from

the center of mass. Obviously the algorithm in this case was helped by the

compactness of the horse and the relatively even distribution of triangles,

although the high level of detail in the head tested its abilities.

Figure 6.2 plots the timings of the horse model and several other data

sets of varying size to demonstrate the complexity of the algorithm. For the

horse model, the assignment of triangles to the viewports took the longest

time. However, as seen by the graph, as the triangle count increases for the

larger models, the sort time overshadows all other parts of the algorithm due to

its greater complexity. Using the quicksort, the sort time line is O(mlog(m)).

This contrasts with the viewport assignment time line which is O(mp) and

therefore is linear if the number of triangles grows and the number of processors

74

0

50

100

150

200

250

0 2e+06 4e+06 6e+06 8e+06 1e+07 1.2e+07

Sec

onds

Number of Polygons

Greedy Viewport Assignment Algorithm Timings

"Assign""Distance"

"Sort""Total"

Figure 6.2: Greedy algorithm timings for various model sizes

75

is kept constant. Likewise, the distance calculation is simply O(m) and is

linear as well. The total of these three parts of the algorithm primarily reflect

the contribution of the sort time resulting in an O(mlog(m)) algorithm when

m >> p.

6.5 Conclusion

The greedy algorithm presented in this chapter gives fast viewport as-

signments that consist of evenly balanced triangle counts. Using the method

presented here of assigning far flung triangles first and attempting to maximize

the mobility of existing viewports, viewport assignments are able to cover all

of the triangles evenly, while at the same time limiting the spatial area they

are required to render in order to increase the resolution for that particular

viewport.

This algorithm will be used extensively for the progressive image com-

position plugin presented in this dissertation. It will be used to initially assign

the triangles to the viewports in a load balanced manner while giving a com-

pletely high resolution display for the initial viewpoint.

76

Chapter 7

Wireless Visualization Control Device

7.1 Introduction

When using very large, multiscreen, tiled displays in conjunction with

the visualization of large data sets, it is important for the user (or users) to

be able to interact easily with the application. In the case of the Metabuffer

[4] project, this is especially true since the aim of using its multiresolution

capabilities is to increase user responsiveness. The Metabuffer currently has

two different multiresolution plugins, each of which requires an easy, portable

user interface.

The first plugin, progressive image composition, uses multiresolution

in order to hold frame rates steady regardless of changing user viewpoints.

It also uses polygon redistribution in order to create high-resolution displays

when the user pauses to analyze key areas of the data set. Programming in

predetermined routes for the data set to be manipulated, while demonstrating

the technique, does not show how the plugin would respond in the real world

to random user input. By tying the plugin to a user interface, the user is free

to stress the plugin by changing views, zooming in or out, or simply navigating

77

through the data set. With this real world interactivity, it is more apparent

how steady frame rates increase the responsiveness that a user gains and the

value of the progressive image composition technique and multiresolution in

general.

The second plugin, foveated vision, tracks the gaze of one or more

users and renders those areas in high resolution. Areas in the periphery of

the user’s view are rendered in low resolution. Therefore, it is necessary to

have a user interface that allows the tracking of multiple users’ gazes at once.

Again, preprogrammed gazes, while demonstrating the technique, do not show

the advantages that higher frame rates have for responsiveness in real world

navigation and thus the value of the foveated vision technique and the use of

multiresolution.

To create such a mobile user interface, standard COTS Windows CE

Pocket PC devices were selected. Equipped with wireless Ethernet PCMCIA

cards, they are lightweight, small, relatively inexpensive, and user friendly.

While eventually gaze trackers on headsets will provide foveated vision infor-

mation, in the meantime the wireless Pocket PC devices serve in this role.

Using mobile computing in conjunction with the Metabuffer opens up many

new possibilities for user interactivity. The next section details the current

state of research in using mobile devices for user input. After that, this paper

discusses the design of the Metabuffer mobile system and to date implemen-

tation results.

78

7.2 Background

Historically, research in using mobile computing for user interfaces can

be divided into three main areas: ubiquitous computing, augmented reality,

and context aware applications. In many cases research projects fulfill the

requirements of more than one area, especially as COTS mobile computing

devices have become more powerful.

7.2.1 Ubiquitous Computing

Ubiquitous computing essentially means bringing the concept of com-

puting out of the computer room and into everyday lives. Instead of doing

work with a computer while sitting in front of a monitor, people go about their

daily activities with computers integrating seamlessly into the environment.

This concept was coined by Weiser [51]. Ironically for the Metabuffer

project, Weiser considers that ”ubiquitous computing is roughly the opposite

of virtual reality.” To him, virtual reality involves putting the computer at the

center whereas ubiquitous computing should revolve around the real world.

Of course, in the case of the Metabuffer, wireless mobile devices are being

integrated into a virtual reality environment.

While the definition may not match the Metabuffer’s application, con-

cepts of ubiquitous computing research certainly do. At Xerox’s PARC lab,

wireless mobile devices called tabs were employed to keep track of roaming

79

employees and allow those employees to remotely set temperature, light, and

humidity levels in different rooms. This is analogous to the types of data the

wireless devices would provide as input for Metabuffer visualization applica-

tions.

7.2.2 Augmented Reality

With augmented reality, virtual reality is used, but only to supplement

the information in the real world. Usually this is through head mounted dis-

plays in conjunction with other computers worn on the user. When the user

walks around his or her environment, the computers display additional infor-

mation over the real world scenes that inform the user about state, structure,

or other attributes.

One type of augmented reality that uses handheld devices is called sit-

uated information spaces [14]. By tracking the location of the user, a handheld

can specify information that would be relevant to the user’s needs or task. For

example, if the user was next to a movie theater, the handheld could display

show times and ticket availability.

This idea could be used in Metabuffer visualization applications. By

knowing where the user is looking, the handheld could display additional infor-

mation. While examining a galaxy data set, for example, tracking the user’s

gaze at a certain star or celestial feature could reveal data about that object

on the handheld leaving the actual display uncluttered for other users to view.

80

7.2.3 Context-Aware Applications

Context-aware applications use information about the user’s location

in order to provide data at the right place at the right time. Essentially this

allows the user to roam freely with applications coming on-line customized to

his or her needs no matter the user’s locale.

Lamming [29] introduces the concept of memory prostheses. By record-

ing information pertinent to a user’s surroundings, this information can be

recollected in a similar circumstance in the future and thus provide the user

with an appropriate set of recalled information.

This concept can be applied to the Metabuffer visualization application

by allowing the wireless input devices to store information about the user’s

navigation patterns as they relate to individual data sets. In this manner,

users can set bookmarks of views in the data set and come back to those views

in the future. They may also be able to take notes on certain areas of the data

set. This information would be stored on the wireless unit independent of any

other user that happens to be viewing the data set and could be recalled at a

future time.

7.3 Implementation

The design of the Metabuffer system is relatively simple from a hard-

ware standpoint. Recent advances in wireless handheld technology have ren-

81

dered what used to be a complicated technical undertaking to just plugging

in a collection of COTS components.

The main piece of this puzzle is the Compaq iPAQ Pocket PC device.

This device runs the Windows CE operating system from Microsoft. The

Windows CE operating system is essentially the standard Win32 API with su-

perfluous parts removed to save space. For example, Windows CE is entirely

Unicode based. Therefore, all ASCII routines have been removed. Although

restrictions like these mean that code has to be written specifically for Win-

dows CE devices, most Windows programmers have little trouble adapting to

the new operating system. A big advantage for Windows CE programmers is

that Microsoft provides the Windows CE development environment and SDKs

free of charge in order to encourage growth of applications for the operating

system.

For wireless connectivity, an Orinoco RG-1000 residential gateway is

employed along with Lucent wireless Ethernet cards. The wireless Ethernet

cards plug into the iPAQs by means of a PCMCIA adapter. They are then con-

figured to talk to the RG-1000 which is connected to the Metabuffer cluster’s

LAN. From this point, communicating over the network is seamless.

The user interface application for the iPAQ is written in standard C

using the Windows CE API. It provides a way to manipulate the orientation

and zoom of the data set being examined, along with a means to provide gaze

information by clicking on a representation of the tiled screen space.

82

Figure 7.1: Wireless visualization device user interface

83

Figure 7.1 shows an actual screen shot of the user interface. The cube

in the screen shot can be rotated by using the iPaq’s stylus. This provides the

orientation of the model. At the top of the shot is a representation of the tiled

display wall. The longhorn icon is placed where the user is gazing, again via

the stylus. At the bottom of the shot is a slider bar which controls the zoom.

The orientation and gaze information received from the graphical UI is

transmitted over the wireless Ethernet as UDP packets to a server residing on

the land based host cluster. This server collects the information from all the

wireless devices and stores the current state of all locally.

At each frame, the Metabuffer application queries the server about the

status of the wireless users. This is done via a named pipes mechanism. The

server was separated from the Metabuffer application because the Metabuffer

emulator uses MPI as its basis. Currently Prism’s version of MPICH does

not support multithreading. Therefore running it as a separate process allows

the Metabuffer to run unencumbered. The individual process model will also

make it easier for other applications to have access to the same data.

Figure 7.2 is an overview of how the entire process works for a Meta-

buffer frame. First, the Windows CE iPAQ device collects information from

the user through its graphical interface. This information is then sent as UDP

packets by wireless Ethernet card to the Orinoco gateway antenna. The gate-

way relays the UDP packets to Prism, which is the firewall for the visualization

cluster. Prism forwards this particular port (currently port 6666) to the Al-

84

MPI process

Listener

MPI process ccvpipe

MPI process

MPI process

Prism

iPAQ

WirelessEthernet

RG−1000

Alpha13, etc

Alpha12

Alpha11

Alpha1

Figure 7.2: Wireless visualization operation

85

pha1 machine located behind the firewall. Running on the Alpha1 machine

is a custom UDP server application called “listener”. This server application

collects UDP packets being sent by the iPAQs and saves the most recent in-

formation. At each frame, the MPI process bound to the Alpha1 machine

queries a named pipe located on Alpha1’s file system called “ccvpipe”. When

this happens a separate thread from “listener” writes the current iPAQ data

to the named pipe. The Alpha1 process receives the data and broadcasts it

over MPI to all of the rendering machines.

7.4 Distribution

The source code distribution for the wireless interface is contained in

the file ccv.zip. It includes four subdirectories:

1. Linux: This directory contains the source code for listener.c, the UDP

server as well as test.c which simply requests information from the named

pipe. It also has a readme document for how to set up the UDP server

on Prism.

2. Source: This directory has the source code for the actual user interface

application.

3. Win32: This directory contains the projects that will build the user in-

terface for a Windows desktop or laptop machine using Microsoft Visual

Studio. It isn’t hard to make Windows CE programs cross platform with

86

their desktop counterparts, and allowing it to run on a desktop machine

facilitates testing.

4. WinCE: The directory has the Windows CE projects that will build

the user interface application for the iPAQ using Microsoft Embedded

Visual Tools.

Most people are familiar with compiling programs for the Windows

desktop environment using Visual Studio. Configuring a programming envi-

ronment for Windows CE is just as simple. Microsoft currently provides the

Embedded Visual Tools system and Windows CE SDKs for free on their web

site [35].

After downloading and installing the Embedded Visual Tools software,

building a Windows CE application is just like a Visual Studio application.

The only difference is to select the processor type of the Windows CE device

(in the iPAQs case it is a StrongARM) and the platform (in the case of the

iPAQ it is PocketPC). Embedded Tools will compile and link the code and

then send it to the device automatically if it is currently synced in it’s cradle.

Once the iPAQ is configured with the user interface code, it is time

to ready the cluster to receive the iPAQ’s UDP transmissions. First, create

the named pipe on Alpha1 (or whatever machine is assigned to receive the

packets):

mknod /home2/wjb/ccvpipe p

87

Next, ensure UDP packets are passed from Prism to the local machine

(in this case Alpha1 which has the IP address of 192.168.128.97) by logging

on to Prism and giving the following port forwarding command for port 6666:

ipmasqadm autofw -A -r udp 6666 6666 -h 192.168.128.97

Then, run the listener server on the local machine (Alpha1) to receive

the UDP packets and write to the named pipe:

listener

Configure the Metabuffer MPI process on Alpha1 to read the named

pipe to get information on positioning. Simply edit the enviro.h file to set

the WIRELESS #define to 1, the WIRELESSSERVER #define to the local

machine (Alpha1), and the WIRELESSPIPE #define to the full path of the

pipe just created.

If the plugin is written correctly, this process can then send the posi-

tioning information using MPI to the other machines so that everyone is syn-

chronized. Currently both the Progressive Image Composition and Foveated

Vision plugins support the wireless user interface.

A common problem that may be seen with the wireless interface is that

the Metabuffer emulator may seem to stall. This usually happens when the

listener UDP server is not running and therefore no data is being fed into the

88

named pipe. The MPI process waits on the empty named pipe and grinds to

a halt. If this happens please check to make sure listener is running before the

Metabuffer emulator and that it is located on the correct machine.

7.5 Conclusion

Using wireless devices have intriguing possibilities as a user interface

medium. In the future, ideas can be taken from past research in ubiquitous

computing, augmented reality, and context-aware applications to provide addi-

tional data on the handhelds. In combination with previous mobile computing

ideas and techniques, using wireless devices to control visualization applica-

tions should result in a more powerful interface for the user.

89

Chapter 8

Progressive Image Composition Plugin

8.1 Introduction

Progressivity is a user interface technique well understood by most

computer users. Perhaps the most obvious use of progressivity is in World

Wide Web browsing. When a user navigates through a website, typically the

largest images are downloaded in stages. First the image arrives quickly in low

resolution form. As time allows, more data is then downloaded from the server

in order to create a high resolution image. Because of this, the user is able to

quickly navigate the site using the low resolution images as aids to find the

page he or she is trying to find. Once the user arrives as that page, the high

resolution images are downloaded while the user is studying the information.

This technique allows the user to be unimpeded while navigating the site, but

still provides high resolution imagery where and when it matters most.

The problem of how to quickly navigate but still retain high quality

image output also exists for rendering large data sets in parallel on multiple

displays. To achieve good user interactivity, an application must guarantee

time-critical rendering of the massive data stream. However, for the instance

90

of displaying a triangular mesh, though a good load balanced partition among

the parallel machines can be computed for a given user view point, new compu-

tation and data shuffling are required whenever the view point is significantly

changed. Either triangles may fall out of the viewport because of the move-

ment of the viewing direction or the viewport cannot cover all the polygons

assigned to it because of zooming. Redistributing primitives or imagelets in

order to render all of the polygons correctly takes time. If the user is simply

navigating the data set, this additional time will result in slower frame rates

hampering user interactivity.

To solve this problem, we propose adapting the concept of progressivity

to the generation of images via image compositing on the Metabuffer, terming

the technique progressive image composition. The Metabuffer is a parallel,

multidisplay, multiresolution image compositing system [4]. To test the tech-

nique we are using the software emulator of the Metabuffer architecture [5].

By employing the Metabuffer’s multi-resolution feature, it is possible

to ensure the user will always have constant frame rates no matter what the

viewing angle or zoom factor. Instead of redistributing polygons or imagelets

while the user is rapidly changing views, a viewport can instead go to a lower

resolution and enlarge in order to accommodate the current polygons assigned

locally to the machine. When the user finally arrives at the view of interest and

stops changing viewpoints, frame rate is no longer a concern. At this point

polygons are be redistributed in order to once again form completely high

91

resolution viewports. This paper shows that progressive image composition

helps to provide for a good balance between user interactivity and frame rates

and image quality.

8.2 Background

The technique of progressivity has been studied by many research

groups for many different applications. Progressive transmission is used to

send information through a network, as with the case of the World Wide Web

for example. Progressive refinement is used for rendering images. Images may

first be created coarsely and then over time improved. Progressive image com-

position relies on a combination of both techniques in order to improve frame

rates.

8.2.1 Progressive Transmission

With the growth of the Internet, there has come a need for ways to

transmit large quantities of graphical information in varying levels of band-

width while still retaining a high degree of user interactivity. Because this

bandwidth can range from a slow analog modem up to a high speed fiber

optic connection, designing a web site to satisfy this requirement is difficult.

By using progressive transmission to regulate the bitstream of the data, it is

possible to satisfy both the slowest and fastest end user.

Shapiro [48] tells how wavelets can be used in image compression in

92

order to generate different bitstream rates. The essential idea is that the most

significant bits of the image are distributed first. This gives the end user a

basic idea of what is to come without downloading the entire picture. Over

time as the less significant bits are received, the image is refined.

Progressive image composition on the Metabuffer shares many of the

characteristics of progressive transmission. Time is the utmost concern in

progressive transmission. The user should be able to at least see a glimpse of

the output in the smallest amount of time by using a coarse representation.

Similarly, in progressive image composition, the goal is to hold frame rates

constant by using lower resolution (and thus lower bandwidth) versions of the

imagery to avoid lags due to network communication. This results is high user

interactivity even in the case of large data sets and relatively low rendering

resources.

8.2.2 Progressive Refinement

Progressive refinement is often used in radiosity in order to improve the

appearance of images over time. The more spare cycles that are available on

the machine, the more iterations can be spent adding to the detail of the final

picture. Forrest [16] shows how such an approach can improve antialiasing

results.

The Metabuffer’s use of progressive image composition is similar to how

progressive refinement has been used in preceding research. Imagery is first

93

computed in the quickest time possible, but over time computation can take

place to improve the quality of the final output. The Metabuffer achieves this

improvement by moving geometry primitives between the rendering machines

whenever the user keeps the view stationary in order to fit the primitives into

high resolution viewports. This is analogous to the computation that takes

place in a raytracing or radiosity application to further define the final image.

8.3 Implementation

There are three main steps in progressive image composition. First

the data set must be partitioned evenly across all of the parallel rendering

machines. In order to render very large data streams there cannot be a global

data set. Second, for each frame the viewport resolution and location must

be determined for each renderer. These rendered viewports are ultimately

composited by the Metabuffer and sent to the tiled display. Third, machines

are constantly determining how they can best adapt their viewports to the

current viewpoint and zoom factor by exchanging data in the background in

order to shrink the area covered by each renderer in image space and thus

create higher resolution imagery. These three steps are described below:

8.3.1 Initial Triangle Assignment

For the start of the visualization, the data set is distributed evenly

among all the rendering machines dependent upon the initial viewing param-

94

eters. The viewing parameters are important because the triangle partitions

assigned to each rendering server optionally should fit within a single high

resolution viewport. If the number of rendering servers is equal to the number

of displays and the resolution of the highest resolution viewport is equal to

the resolution of the tiles in the display this will always be possible.

Samanta [44] gives a variety of algorithms for solving this issue for a

sort first image compositing system. However, the Metabuffer, which is a

sort last image compositing network, allows more freedom in assigning image

space since viewports are allowed to overlap. In order to take advantage of

this addition flexibility and to obtain the best triangle distribution, a greedy

algorithm is currently used [5].

The greedy algorithm creates viewports by assigning the furthest tri-

angles from the center of mass first while attempting to retain viewport mo-

bility. In this context mobility means that assigning an additional triangle

for coverage to an already existing viewport will not limit its ability to shift

to accommodate additional triangles. By using this metric, the algorithm is

guaranteed to cover far flung polygons while still allowing for the best possible

load balancing of the bulk of triangles.

8.3.2 Viewport and Resolution Determination

With the triangles assigned, the next problem is how to determine

what parameters to use in order to make the individual images created by

95

the renderers blend with the rest of the composited display. This primarily

involves computing the viewport size and location for each renderer relative

to the overall viewing space.

To do this, the bounding box that OpenGL will use to rotate and

translate the renderer’s portion of the data set is set to the bounding box of

the data set as a whole. The coordinates of the bounding box for the renderer’s

portion of the data set is then computed in relation to this overall bounding

box. Thus, the bounding box of the triangles assigned to the renderer will be a

subset of the bounding box for the entire data set. By following the corners of

this subbounding box around the display space, the location of the renderer’s

polygons can be precisely tracked and measured.

At the beginning of each frame, a viewing frustum is created for the

entire tiled display using glFrustum(). The projection matrix obtained from

this call won’t be used to actually create an image. Rather, it will be used

to calculate the screen coordinates of the subbounding box corners for the

renderer’s portion of the data set. The modelview matrix is also obtained

after the proper rotations and translations of the object have occurred. Given

these two matrices, it is easy to determine where the eight corners of the subset

bounding box for the renderer’s portion of the triangles would lie on the overall

display space.

96

xe

ye

ze

we

= M

xo

yo

zo

wo

(8.1)

As shown in equation 8.1, each bounding box’s corner object coordinate

is first multiplied by the modelview matrix to correctly rotate, translate, and

scale it in order to compute the eye coordinate.

xc

yc

zc

wc

= P

xe

ye

ze

we

(8.2)

The eye coordinate is then multiplied by the overall projection matrix

in equation 8.2. This obtains the corner’s clip coordinate in the display space.

xd

yd

zd

=

xcwcycwczcwc

(8.3)

Equation 8.3 then normalizes this into the device coordinate. A simple

scaling of the device coordinate yields the exact locations of the subbounding

box corners.

97

With the overall display coordinates in hand, the minimum and maxi-

mum x and y coordinates are found from the eight corners. These values are

then clipped by the boundaries of the overall display space. The extent of

the final x and y coordinates of the subbounding box determine the size and

therefore resolution of the viewport that will be needed. The values of the x

and y coordinates also determine the position of the viewport in the display

space.

Rendering the data given the viewport size and location usually requires

setting up an asymmetrical frustum. Though the user will be looking at the

entire display as a whole, a renderer will only be creating a subset of that

display. Thus, the frustum for this renderer must originate at the eye, but will

be off center depending on the locations of the viewport and not perpendicular

to the projection plane (except in the case of a perfectly centered viewport in

the overall display of course). Figure 8.1 shows the projection issue for the

progressive image composition plugin. While the centerline of the overall view

is perpendicular to the projection plane, the centerline of the viewport view

is not. Creating a symmetric frustum for the viewport view will yield an

inaccurate rendering of that portion of the scene.

The issue of asymmetrical frustums are most often encountered when

rendering stereo images. Because our two eyes are slightly off center, yet both

look at the exact same area, the frustum has to be slightly off center and not

perpendicular to the projection plane in order to get the correct projection for

98

User

Overall view(Symmetric frustum)

Viewport view(Asymmetric frustum)

Display (Projection Plane)

Figure 8.1: Asymmetrical frustum illustration

a stereo image. If this is not taken into account, “toe in” will result which

essentially distorts the resulting three dimensional effect. In the case of image

compositing, the affects of ignoring this projection problem will cause the

disparate images not to align correctly when composited.

Because most OpenGL implementations support stereo rendering, it

is very easy to establish an asymmetrical frustum. By taking the extents of

the viewport screen locations and mapping those back to the original frustum

values of the overall display space, it is possible to correctly determine the

frustum needed for each viewport.

99

8.3.3 Data Exchange

Over time, after the user has navigated the data set, it is very likely

that many of the Metabuffer viewports will need to shift to lower resolutions

in order to accommodate all of the triangles for which they are responsible.

Some kind of process, therefore, is needed to redistribute the triangles in order

to regain a collection of viewports that are all in the highest resolution for the

given viewport.

In order to do this, we propose a method based on progressive refine-

ment where excess cycles are used in the background to continually redistribute

polygons and shrink viewport sizes. While the user navigates, viewports may

need to shift to lower resolutions because all of their triangles do not fit within

the high resolution clipping area. However, the user will see no reduction in

frame rate because the partitions are still evenly load balanced and no com-

munication has had to take place. When the user sees something interesting

in the scene and starts to study it, the servers finally have time to redistribute

the blocks among themselves to try to reduce the size of the viewports and

increase resolution and therefore image detail.

To have a reasonable granularity for reshuffling triangles, we may break

the bounding box of the surface into hierarchical boxes [2] blocks and use the

block as the unit of distribution. At the heart of the scheme is a central server

that manages the allocation of those blocks. The central server keeps track of

block assignment to rendering servers and takes the view that the user has cho-

100

sen and determines which blocks when transferred between two processors will

help reduce viewport sizes for that particular view, while preserving the load

balance property. It tells the rendering servers connected to the Metabuffer

which blocks to render and which blocks to send to other rendering servers.

The rendering servers themselves store only the blocks of triangles that

they are currently assigned. If the central server tells a rendering server to

ship a block, that server sends the block directly to the other server over

the network. The longer a user looks at a particular view, the more blocks

that can be transferred over the network between the servers. In essence,

the speed of the network used in the cluster affects only the speed of the

progressive improvements in resolution and not the speed that the user can

navigate through the data set.

8.4 Results

The configuration used to test the progressive image composition plu-

gin consisted of 19 machines in our visualization cluster. Each machine was

equipped with a high performance Hercules Prophet II graphics card, 256 MB

of RAM, an 800 MHz Pentium III processor and ran the Linux operation sys-

tem. 9 of the machines were set to actually emulate the Metabuffer hardware.

They performed the image compositing and output of the 3 by 3 tiled display

space. The other 10 machines were tasked with actually rendering the scenes.

All 19 machines were connected via 100 Mbps Fast Ethernet. We lim-

101

ited the test to 19 machines instead of the full 32 in the cluster with graphics

cards because the higher amounts of data transfer exceeded the capabilities

of the network and significantly slowed emulator performance. We anticipate

that the addition of Compaq’s ServerNet II to the cluster will greatly reduce

this constraint. The actual Metabuffer design, when put into hardware form,

eliminates this overhead entirely.

Dataset Size Viewport RenderOceanographic 392,332 6.6 seconds 0.03 secondsSanta Barbara 6,163,390 88.6 seconds 0.44 secondsVisible Human 9,128,798 135.36 seconds 0.78 seconds

Table 8.1: Progressive data set information

Three data sets of different sizes were used to demonstrate the perfor-

mance of the progressive image composition plugin for the Metabuffer emu-

lator. Table 8.1 gives the size of each data set, the time it took to initially

precompute the triangle to viewport assignments using the greedy algorithm,

and finally the average time needed to render each frame of the 720 frame

movies presented in this report. As will be shown in the following graphs, the

per frame timings are constant irrespective of the user’s viewpoint, so these

averages essentially tell the frame rate for the entire movie.

Also, even though the data sets used were of increasing size, the number

of rendering machines was kept constant at 10. This means that in the case

of the visible human, frame rates were slower than what would be needed for

a real time display. Including more machines as renderers would reduce the

102

workload of each machine and lower the rendering time to real time 30 frames

per second rates. The only penalty imposed by the Metabuffer hardware for

scaling up to more renderers is a few pixels of latency per machine with no

drop in throughput.

8.4.1 Oceanographic

The oceanographic data set is an isosurface generated by Zhang [53]

consisting of 392,332 triangles. It shows the topography of the ocean floor.

Dividing the data sets into 10 load balanced viewports yielding 39,233 triangles

per renderer.

To demonstrate that the frame rates do not change regardless of the

user’s viewpoint a 720 frame movie was generated in which the data set was

zoomed in and zoomed out while constantly being rotated. A sample of the

frames taken throughout the movie is included in figure 8.2.

At the beginning of the movie, the image is cleaved into the 9 tiles that

form the 3 by 3 tiled display space. During the movie, these tiles are rejoined

to show the overall display, and then separated again at the end to reinforce

the fact that the Metabuffer is acting on a multitiled display space.

The black boxes visible in the frames show the viewport locations. As

the data set is zoomed in and out, it is readily apparent when the viewports

shift from high resolution to low resolution by the sizes of these black boxes.

103

Frame 3 Frame 79

Frame 155 Frame 235

Frame 360 Frame 422

Frame 461 Frame 531

Frame 605 Frame 707

Figure 8.2: Sample frames from the oceanographic movie

104

Initially, the individual viewports belonging to each renderer are cycled

around in a circle to demonstrate that they can be located anywhere within the

global display space and are indeed disparate. Each viewport is color coded

according to the renderer that drew it.

Later, the viewports are composited together to form the data set.

The user zooms in while rotating the scene. As this is occurring, viewports

dynamically move and resize themselves to adjust to the expanding extent

they must cover to render all of their triangles. Finally, the user zooms out

and the viewports shrink.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 100 200 300 400 500 600 700

Sec

onds

Frame Number

Progressive Oceanographic Movie Timings

"Renderer0""Renderer1""Renderer2""Renderer3""Renderer4""Renderer5""Renderer6""Renderer7""Renderer8""Renderer9"

Figure 8.3: Rendering times for oceanographic movie frames

Figure Fig:progocgraph gives the timings for the oceanographic movie

105

throughout all 720 frames. For comparison with the other data sets they are

scaled from 0 to 0.85 seconds. Note that the timings for each frame are almost

completely flat. No communication has to occur between frames, and this lack

of overhead means that the user sees no drop in interactivity regardless of how

the data set is viewed. From the graph, it is evident that the renderers are all

reasonably load balanced.

8.4.2 Santa Barbara

The Santa Barbara data set is an isosurface taken of the gravity fields

for a galaxy. This data set is almost 16 times larger than the oceanographic

one shown previously.

Figure 8.4 shows some sample frames from the 720 frame movie. Just

as with the oceanographic example, the viewports are first circled to show that

they are distinct. Afterwards the data set is zoomed in and zoomed out while

constantly being rotated. Again, each viewport is color coded according to

the renderer that drew it.

The graph in figure 8.5 reveals timing results similar to that of the

oceanographic example. Again, they are flat, owing to the lack of interframe

communication needs. The viewports are also relatively well load balanced

resulting in efficient use of all 10 renderers.

106

Frame 3 Frame 89

Frame 160 Frame 240

Frame 325 Frame 362

Frame 474 Frame 546

Frame 617 Frame 715

Figure 8.4: Sample frames from the Santa Barbara movie

107

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 100 200 300 400 500 600 700

Sec

onds

Frame Number

Progressive Santa Barbara Movie Timings


Figure 8.5: Rendering times of Santa Barbara movie frames

108

8.4.3 Visible Human

The final sample data set is an isosurface taken from the visible human

model. This data set is more than 23 times larger than the oceanographic

example.

Figure 8.6 reveals sample frames from the 720 frame movie. The view-

ports circle and are then composited together to form the overall display. Again

the data set is zoomed in and zoomed out while being constantly rotated. As

with the other two examples, each viewport is color coded according to the

renderer that drew it.

Figure 8.7 shows the timings of the movie. Just as with the previous

two, they are flat resulting in constant frame rates for the user and good

interactivity. However, these timings range from 0.43 seconds all the way to

0.78 seconds. The viewports that were created all had an equal number of

polygons assigned to them. But, in some cases, the number of polygons is not

an accurate representation of the rendering load. It is obvious that in this

particular case, some other metric will need to be used to load balance the

data set evenly.

All of the frames for these movies were created for a 3 by 3 display to

facilitate an easier presentation of them for this article. In reality the cluster

hosting the Metabuffer is connected to a 5 by 2 tiled display space in our

visualization laboratory. Typically 10 machines are used to do the Metabuffer

109

Frame 2 Frame 82

Frame 160 Frame 236

Frame 316 Frame 365

Frame 431 Frame 543

Frame 646 Frame 702

Figure 8.6: Sample frames from the visible human movie

110

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 100 200 300 400 500 600 700

Sec

onds

Frame Number

Progressive Visible Human Movie Timings


Figure 8.7: Rendering times for visible human movie frames

Figure 8.8: Composited visible human in visualization lab

111

emulation, each responsible for driving one of the displays. The composited

visible human is pictured in figure 8.8 from our visualization laboratory during

an emulator run.

8.5 Conclusion

Because of the pipelined design of the Metabuffer, more machines than

the 10 used here in these experiments could be harnessed. The only penalty

would be an increase in latency, and this increase would be measured in pixels–

a small tradeoff. Given the resources, there is no limit to how many machines

can be added and thus how many times the triangles can be divided into

smaller and smaller viewports. A target of 30 frames per seconds with a large

cluster is feasible for data sets of the sizes presented in this report.

In the future, we will explore methods to redistribute the polygons to

shrink viewport sizes and increase resolution interactively based on the client

server framework illustrated in this report. This processing can take place in

the background while the user is studying a scene. We do not anticipate that

this data movement will affect frame timings in any way. Rather, it will simply

increase the resolution of the scene via progressive refinement.

The application of progressive image composition with the Metabuffer

shows how the Metabuffer architecture can assist in improving load balancing

and user interactivity while still achieving high quality output images. Pro-

gressive display is a common feature in many computing applications (most

112

notably web browsing) and has been accepted by users as an adequate way to

present data in order to gain interactivity. The ability to provide fast frame

rates for data set navigation while still allowing for high-resolution output im-

ages at arbitrary viewpoints provides a good balance between speed and image

quality.

113

Chapter 9

Foveated Vision Plugin

9.1 Introduction

Our own eyes can only sense detail directly where we look. Objects in

our peripheral vision appear in low resolution and lack definition. This basic

biological fact is a result of the concentration of rods and cones in the retina

of the human eye. A higher concentration exists at the center with the density

gradually becoming lower and lower towards the edges. In fact, the human

eye even has a blind spot where nerves exit the eye ball and there are no rods

or cones at all. Our brain processes what we are seeing in order to account

for the blind spot and differing rod and cone densities. As a consequence

of human biology, even though computer visualization systems may render a

large display in high resolution, by the time that information gets to our brain,

much of the information has been lost by the limitations of our vision system.

Because visualization displays and data sets are becoming larger, this

fact has important consequences. Already cave type virtual reality labs employ

multiple projectors for an immense immersive display. By tiling the higher

resolution projectors or panels available today, creating enormous displays

114

with billions of pixels is practical. IBM, for example, currently has a 3000 by

3000 pixel LCD panel consisting of 9 million pixels. Creating an 11 by 11 grid

of those panels would result in a display consisting of over one billion pixels.

Rendering such displays in high resolution to visualize extremely large data

sets uses a tremendous amount of computing resources, takes a large amount of

time, and thus results in slow frame rates. This despite the fact that, because

of our limited vision systems, much of the display either won’t be seen at all

(because it is behind us in a cave arrangement) or only in the periphery in low

resolution.

This is the concept for the foveated vision application for the Metabuf-

fer. The Metabuffer is a parallel, multidisplay, multiresolution image composit-

ing system [4]. Using the physical characteristics of the eye as an advantage,

the computing resources of the Metabuffer are matched to the areas in the

display that are being examined. The majority of the rendering servers con-

centrate their work where the user is gazing. In this manner a high resolution

image is generated quickly exactly where the user is focused. The periphery

of the display is rendered in lower and lower levels of resolution and detail

corresponding to the rod/cone concentration in the human eye. This allows

only a few renderers to be used to create the entire periphery of what could be

a building-sized display. To test the procedure we are using the software em-

ulator [5] of the Metabuffer architecture. This paper shows that the foveated

vision technique on a parallel, multidisplay, multiresolution image composi-

tion system concentrates rendering power where it is needed helping lower

115

computation cost resulting in high frame rates and good user interactivity.

9.2 Background

Using the foveation of the human visual system as an advantage is

nothing new. Several research groups have tackled problems such as image

transmission and image processing by using the low resolution areas of the eye

as an asset.

9.2.1 Image Processing

One problem that benefits from foveated techniques is image processing.

Image processing is often a very computationally intensive task. Every pixel in

an image must have calculations performed on it to perform pattern matching,

edge detection, or other operations.

Many times, though, this image processing is being done to simulate

what a normal human eye would be seeing. Facial recognition is one such

example. The human eye lacks detail in its peripheral view. Therefore, the

brain does not have to process nearly as much information from the edges of

the view as it does in the center. This hindrance actually helps the brain by

preventing an overload of visual stimulus.

Researchers have taken advantage of this fact by using methods to avoid

processing the enormous quantities of high resolution pixels in the periphery.

116

After all, since the brain does not have to deal with these peripheral pixels,

neither should the computer. Special foveated CCD cameras have been de-

veloped which record high resolution only at the center of the gaze in order

to lessen the information overload resulting from taking in imagery from high

resolution cameras which sense all areas equally [13, 52, 41]. Image processing

applications can then take advantage of this reduced imagery to concentrate

their algorithms on the center of the scene, rather than the edges of the gaze,

just as the brain does in conjunction with the human eye.

9.2.2 Image Transmission

Another problem which has used foveated vision is image transmis-

sion. Full motion video can require large amounts of data to be transmitted.

Usually the amount of bandwidth available is the limiting factor facing this

transmission. Any technique that lessens the need for data will greatly help

the image transmission problem. Since the peripheral vision of the human eye

cannot see high resolution imagery, it makes little sense to have to transmit

this peripheral image data that eventually will not even be processed by our

vision system.

This is the technique used by Geisler [19]. His research applies foveated

techniques to MPEG encoding. Essentially the MPEG stream is recorded

at successive levels of resolution. By recording the user’s gaze, a “foveated

pyramid” is created with high resolution imagery in the center which becomes

117

successively lower the farther the imagery happens to be from the user’s gaze.

Geisler reports that with foveated techniques the MPEG bandwidth

requirements dropped by a factor of three. He also states that if bandwidth is

kept constant, frame rates could instead increase by a factor of three. Finally

Geisler remarks that foveated techniques could easily be applied to image

generation by using low resolution and reduced levels of detail. These very

techniques will be exploited in the Metabuffer plugin.

9.2.3 Image Generation

Although not specifically tied to applications involving eye tracking,

several research groups have studied using multiresolution to speed up image

generation. Hoppe [23] illustrates how progressive meshes can be used to

significantly increase performance when rendering large data sets. He shows

how different levels of detail can be used depending if the data is close or

far away from the user. Shamir [47] reveals how to use DAGs in order to

efficiently create multiresolution meshes on time varying deforming meshes.

Magillo [32] presents a library in order to model multiresolution meshes. Saito

[43] discusses how to use wavelets to compactly encode and efficiently retrieve

hierarchical multiresolution representations of objects.

Progressive meshes will be used by the Metabuffer foveated vision ap-

plication to present level of detail views of the scene to the user based on gaze

location. Currently, the progressive meshes used by the Metabuffer foveated

118

vision application do not use wavelet compression, but this method could serve

to compress source data further to better handle large data sets.

9.3 Implementation

0 101020 2040 4060 600

0.2

0.4

0.6

0.8

1.0

Rel

ativ

e A

cuity

Blind Spot

Visual Acuity Across the Retina

Degrees from Fovea

Figure 9.1: Coren’s acuity graph

Acuity is the term used to describe the eye’s ability to resolve detail.

Typically, this measurement is expressed as an angle corresponding to the

smallest span the eye can identify. As shown in figure 9.1 by Coren [9], acuity

changes as a function of the distance away from the center of the eye. This

is due to the concentration of rods and cones in the retina. The highest

concentration exists at the center of the eye in the fovea, with the density

119

becoming less and less towards the periphery. A blind spot exists where the

optic nerve exits the eyeball.

Coren’s graph reveals that the drop off in acuity, and thus resolution, in

the eye is exponential. In fact, within 10 degrees it drops by almost 80 percent.

By matching the rendering resources of the computer graphics system to this

acuity graph, the rendering power of the system can be concentrated mainly

in the areas where it is needed most–the center of the user’s gaze. Only a

small portion of the system is needed to generate the low level of detail and

resolution towards the periphery.

A foveated vision system can be designed using Coren’s graph either

via the continuous method or the discrete method. The discrete method using

the hardware capabilities of the Metabuffer will be covered in this paper.

9.3.1 Continuous Method

With the continuous method, level of detail and resolution is matched

directly to Coren’s acuity graph. By using a wavelet encoded mesh, it is

possible to finely adjust the complexity of the scene. Depending upon the

distance from the center of the user’s view, an error value corresponding to

Coren’s graph can be used to walk through the wavelet encoding in order to

obtain the proper amount of detail for every area in the scene. Likewise, this

same error value can be used to adjust the level of resolution used to generate

the scene. Higher error values would allow lower levels of resolution. A very

120

similar method employing hierarchical bounding boxes [2] could also be used.

In either case, delays resulting from data locality issues could be quelled by

utilizing progressivity. As with progressive image composition [3], switching

to lower resolution viewports would allow renderers to cover all the polygons

they are responsible for drawing while still keeping the frame rate high. Over

time polygons can dynamically be moved to achieve the high resolution output

imagery.

Of course, other metrics besides Coren’s acuity graph could be used

to direct the resolution and level of detail. In these cases, the foveated vision

system dealing strictly with user gazes can instead be generalized into a region

of interest (ROI) application. This region could be controlled via user input

from a wireless mouse or other input device instead of merely being taken from

gaze tracking hardware. The region of interest could also be modified by past

user history–keeping previous areas of interest in focus. Another characteristic

that could modify resolution and level of detail is prominent features in the

data set. Algorithms could detect high frequency changes in the data set and

bring those areas into closer focus since they could yield interesting informa-

tion. Distance from the user is also a trait that could be used to influence the

level of detail in a scene such as is done in Hoppe’s work [23].

121

9.3.2 Discrete Method

Applying the discrete method to the Metabuffer hardware makes sense

since the Metabuffer is able to generate viewports only in integer increments

of different resolution. Because of this limitation, instead of using Coren’s

complete graph as a queue for the level of resolution, individual points on

that graph are taken for each Metabuffer viewport resolution multiple. These

individual points are used to precompute a hierarchical mesh of the model to

be used in generating the scene.

For example, in the case shown in figure 9.2, the foveated vision ap-

plication using the Metabuffer employs three differently sized viewport. The

smallest viewport contains the highest resolution and is centered at the user’s

focus. This area corresponds to the peak in Coren’s acuity graph and will be

assigned the highest level of detail data set. The next larger viewport imple-

mented by the Metabuffer is in medium resolution. To find the level of detail

for this area, the highest acuity level covered by this area in Coren’s graph is

used. In this case, it would be about 20% of the detail of the high resolution

data set. Likewise, the largest and lowest resolution viewport implemented

by the Metabuffer uses a level of detail of approximately 10% as according to

Coren’s graph.

With polygon counts in the medium and low resolution viewports run-

ning 20% and 10% of the polygon counts in the high resolution viewport, it is

possible to match the greater number of rendering servers to the area of the

122

user’s focus. Using a cluster of rendering servers, 77% of these can be assigned

to generate the imagery for the high resolution high level of detail viewport.

Because the medium resolution viewport consists of only 20% of the polygons

as the high resolution viewport, only 15% of the machines are needed to ren-

der this area. Finally, since the lowest resolution consists of only 10% of the

polygons, only 8% of the machines are necessary to render the entire region in

a load balanced manner.

9.3.3 Load Balancing

The main problem in creating a foveated vision plugin for the Meta-

buffer system is how best to utilize the rendering resources available. They

should be organized in order to achieve the best degree of efficiency and the

fastest frame rates. The multiple parallel rendering machines need to be load

balanced no matter what viewpoint the user chooses. This organizational

problem is presented formally as follows.

Conditions

1. There exists a screen of n tiles and m rendering servers (m ≥ n). Each

tile has the same size of w × h pixels and each server has the same

rendering capability c triangles/second.

2. There are p triangles that project into the screen. We assume each

triangle takes the same amount of time to render.

123

Constraints

1. A high resolution w × h pixel area must be rendered where the user(s)

are gazing at all times. Regions surrounding this area can be rendered

in diminishing level of detail and resolution corresponding to the drop

off in rod and cone concentration in the peripheral view of the eye.

2. The data set could be extremely large, and thus all p triangles along with

the varying levels of detail of this triangle set must be evenly distributed

across all the machines. There cannot be a global data set that resides

on each machines.

3. The frame rate should be at least on par with the pm×c time possible

with the progressive image composition method. Taking into account the

diminished triangle count from decimated data sets, this means that the

rendering machines need to be fairly load balanced for any user viewpoint

even if the data set is almost certainly heterogeneously distributed across

the scene.

The goal is not only to find the best assignment of levels of detail of

data to renderers but also the best match of renderers to display space such

that the display is rendered in the shortest time.

In order to solve this problem, the multiresolution features of the meta-

buffer will be used extensively. In the case of a single user, viewports are

124

7 renderers9,124,090 polygons

1,303,441 polygons each

Medium Resolution2 renderers

530,053 polygons each1,060,106 polygons

Low Resolution

241,988 polygons

High Resolution1 renderer

Figure 9.2: Foveated pyramid for visible human example

125

arranged in a configuration analogous to Geisler’s “foveated pyramid”. Fig-

ure 9.2 shows the “foveated pyramid” for the visible human example in this

paper. High resolution viewports are located at the center of the user’s gaze.

Successively lower resolution viewports radiate out until the lowest resolution

viewport fills the entire display.

The ability to concentrate the rendering power of the Metabuffer in

the area of the user’s gaze is possible because of progressive meshes that have

been created by decimating data sets. The large low resolution viewports in

the periphery are required to render a much greater area that would normally

consist of a large amount of polygons. By using decimated data sets, however,

the quantity of polygons in this area can be much less than the number of

polygons contained in the small high resolution viewport. Therefore, a small

number of rendering servers can adequately render larger area.

Ensuring that the rendering servers are load balanced despite the user’s

viewpoint is achieved by assigning the triangles belonging to each progressive

mesh modulo the number of processors assigned to that mesh. This means

that the polygons for the data set are evenly distributed spatially among all

the processors. No matter where the user looks, all the processors will be

responsible for an even number of polygons. This is the technique used by

PixelFlow [12] to load balance its custom hardware even when dealing with

nonhomogeneous data sets.

126

9.3.4 Compositing

In order to merge the layers of multiresolution imagery together and

simulate the “foveated pyramid” using the Metabuffer, it is necessary to en-

sure that the higher level resolution imagery always takes precedence over

lower level resolution imagery. To do this, lower resolution rendering servers

remove portions of their viewports that will be covered by higher resolution

imagery using the hardware stencil buffer. With most of today’s graphics

cards, including the GeForce2 boards in our cluster, stencil tests are always

performed when doing Z buffer comparisons. Thus, the use of a stencil buffer

is essentially free in terms of performance cost. With the areas not covered by

the stencil vacant, pixels from high resolution renderers are free to be compos-

ited over these areas. This effectively performs a painters algorithm operation

using the existing architecture of the Metabuffer.

To allow for continuity, neighboring viewports of different resolutions

are allowed to overlap slightly. In these areas of overlap, dithering patterns are

performed. Again, this is done using the stencil buffer. Checkerboard patterns

are applied at the edges of the higher resolution viewport. By pushing the far

and near clipping planes slightly farther back for the neighboring low resolution

area, the border area between the two viewports consists of half higher and

half lower resolution data, but with a checkerboard mesh that is of the higher

resolution. This screen door transparency technique effectively smooths the

output image at the transitions between the higher and lower resolutions.

127

Blending this area masks discontinuities in the progressive meshes and in the

resolution changes.

9.3.5 Tracking

Tracking the movement of the retina typically is done using head mounted

displays with CCD cameras aimed into the eye. Until such a system is installed

in the visualization lab, the research in this paper uses a wireless visualization

device implemented using Compaq iPAQs running Windows CE and wireless

Ethernet to allow the user to input gaze areas and rotate and zoom the model.

9.4 Results

The configuration used to test the progressive image composition plu-

gin consisted of 19 machines in our visualization cluster. Each machine was

equipped with a high performance Hercules Prophet II graphics card, 256 MB

of RAM, an 800 MHz Pentium III processor and ran the Linux operation sys-

tem. 9 of the machines were set to actually emulate the Metabuffer hardware.

They performed the image compositing and output of the 3 by 3 tiled display

space. The other 10 machines were tasked with actually rendering the scenes.

All 19 machines were connected via 100 Mbps Fast Ethernet. We lim-

ited the test to 19 machines instead of the full 32 in the cluster with graphics

cards because the higher amounts of data transfer exceeded the capabilities

of the network and significantly slowed emulator performance. We anticipate

128

that the addition of Compaq’s ServerNet II to the cluster will greatly reduce

this constraint. The actual Metabuffer design, when put into hardware form,

eliminates this overhead entirely.

Three data sets are used to demonstrate the capabilities of the foveated

vision plugin for the Metabuffer: an isosurface of an engine block, a skeletal iso-

surface of the visible human, and a epidermal isosurface of the visible human.

Both contain progressive meshes generated by the fast isosurface extraction

system developed by Zhang [53].

Dataset Size Viewport RenderEngine 617,910 N/A 0.02 seconds

Skeleton 6,352,801 N/A 0.57 secondsVisible Human 9,128,798 N/A 0.81 seconds

Table 9.1: Foveated data set information

The statistics for each are shown in figure 9.1. Because we are doing

a simple even division of the data set among the processors, the time needed

to assign triangles to viewports does not really apply to the foveated vision

plugin. The render timings for each data set reflect the average time needed

to compute each frame in a 720 frame movie with the model being zoomed

and rotated. As the graphs later will show, the foveated vision plugin provides

constant frame rates no matter what the viewpoint, so these average timings

are in fact the frame times for any point in the movie.

Decimated data sets coupled with variable sized viewports means that

129

rendering servers can be concentrated at the user’s gaze. In the example

presented in this paper with a 3 by 3 tiled display and 10 renderers, 7 renderers

deal with the high resolution viewing area, 2 deal with the next larger area, and

1 works with the lowest resolution viewport covering the entire display. The

data set with the highest level of detail is divided evenly among the 7 machines.

The middle level of detail data set is divided among the 2. Finally, the lowest

level of detail data set is given to the one machine which is responsible for the

extreme periphery. This even division means that large data sets can be easily

used in the Metabuffer system. The large amount of memory on the cluster

as a whole is used collectively to store the polygon count.

9.4.1 Visible Human

In the case of the visible human data set, the highest resolution mesh

consists of 9,124,090 polygons. The medium resolution mesh consists of 1,060,106

polygons. Finally the lowest resolution mesh has only 241,988 polygons. Given

the processor assignments from above with the polygon counts from the pro-

gressive meshes of the visible human generated by the isosurface extraction,

the high resolution mesh is divided among 7 rendering servers resulting in

1,303,441 polygons per server. The medium resolution mesh is divided be-

tween 2 rendering servers giving 530,053 polygons per server. The low res-

olution mesh is assigned to one rendering server which is responsible for all

241,988 polygons. At first it may seem that these assignments are imbalanced,

but it is important to remember that, because the high resolution imagery will

130

only be drawn for one area of the display, not all of the polygons assigned to

the high resolution renderers will need to be drawn. This is true to a lesser

degree for the medium resolution polygons too. Because the polygons for all

the servers are distributed evenly across object space, different viewpoints or

zooms should not affect loading.

The images in figure 9.3 show 10 of the frames from a 720 frame movie.

At the beginning and end of the movie, the nine separate screens in the tiled

display split apart to reveal the geometry of the overall scene. In the middle

of the movie they join together to show how the unified display would look.

During the movie, the visible human data set is moved through a zoom

in and zoom out while being continually rotated. Meanwhile, the user’s gaze is

being tracked and that area is rendered in high resolution no matter what the

viewpoint. The user is not restricted to where he or she may look. Anywhere

in the entire display space is a valid place for the high resolution viewport.

Polygons are color coded according to which rendering server created

them. This gives the imagery within the high resolution viewport a mottled

appearance, since 7 rendering machines are responsible for this area. The

medium resolution viewport, on the other hand, only has two colors from the

two renderers that are assigned to it. Finally, the low resolution viewport is

being rendered by only one machine and thus is a solid green.

Notice that the display decreases in resolution and complexity according

131

Frame 4 Frame 119

Frame 255 Frame 352

Frame 360 Frame 367

Frame 452 Frame 557

Frame 643 Frame 715

Figure 9.3: Sample frames from the visible human movie

132

to the “foveated pyramid” of multiresolution viewports which are marked as

black rectangles. The level of detail differences in the progressive meshes

and the resolution differences are most noticeable in the zoomed in views.

For example, the fine detail of the lower torso of the human inside the high

resolution viewport contrasts with the less detailed data set being rendered by

the low resolution viewport of the leg in these views.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 100 200 300 400 500 600 700

Sec

onds

Frame Number

Foveated Visible Human Movie Timings


Figure 9.4: Rendering times for visible human movie frames

Timings from the movie are shown in figure 9.4. Because the polygons

are distributed evenly across the scene between the processors, all the timing

lines from the 720 frame movie are flat. No matter where the user looks or how

much he or she zooms into the scene, the load will always be the same. Because

133

a parallel application is only as fast as its slowest component, the frame rate

for this example using 10 rendering machines would be 0.81 seconds per frame.

However, because of the scalable nature of the Metabuffer architecture, adding

additional rendering machines only results in additional pixels worth of latency

and does not affect throughput. By applying 100 machines to render the same

example, the data set would be further reduced by a factor of 10 and so would

the rendering times. More rendering machines would result in similar increases

in frame rate.

9.4.2 Engine

For the engine data set, the highest resolution mesh consists of 617,910

polygons. The medium resolution mesh has 46,082 polygons and the lowest

resolution mesh consists of 10,728 polygons. With the processor configuration

of above. this means that the high resolution mesh is partitioned into units of

88,273 polygons, the medium resolution mesh is divided into 23,041 polygons

each, and the low resolution mesh is assigned to one processor responsible

for 10,728 polygons. Again, the polygon distributions are not even across

the resolution groups, but the majority of renderers (those rendering the high

resolution area) are completely balanced in terms of polygon count. Those high

resolution renderers will be the determining factor in frame timings, since they

are responsible for the largest polygon counts. Thus, the minority of renderers

should not adversely affect either the timings or the efficiency of the system.

134

Frame 2 Frame 78

Frame 145 Frame 213

Frame 286 Frame 360

Frame 439 Frame 516

Frame 596 Frame 677

Figure 9.5: Sample frames from the engine movie

135

Figure 9.5 shows 10 of the frames from the 720 frame movie created

using the engine data set. Just as with the visible human example, the data

set is zoomed in and out while constantly being rotated. The region of interest

controlled by the user is constantly in high resolution with the periphery falling

off in detail according to Coren’s model and Geisler’s “foveated pyramid”.

Again, each viewport is color coded according to the renderer that drew it.

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 100 200 300 400 500 600 700

Sec

onds

Frame Number

Foveated Engine Movie Timings


Figure 9.6: Rendering times for engine movie frames

The timings for the engine movie, set to the same scale as the visible

human movie for comparison, are shown in figure 9.6. Again, just as with

the visible human, the timings are flat no matter what viewpoint or region of

interest is chosen. In the case of the engine model, the 10 machines used in the

136

rendering of the frames are more than enough to create 30 frames per second.

Again, if this were not the case, the Metabuffer architecture is easily scalable

to allow for more rendering machines which will subdivide the polygon count

further and allow for faster frame times.

9.4.3 Skeleton

With the skeletal data set, the foveated vision plugin for the Metabuffer

behaves just as the two examples. The skeletal data set consisted of 6,352,801

polygons in the high resolution viewport split over 7 processors resulting in

907,543 polygons per processor. For the medium resolution level of detail,

there were 664,528 polygons split over two processors giving 332,264 polygons

per processor. Finally, in the lowest resolution level of detail there were only

138,594 polygons assigned to a single machine.

Figure 9.7 shows the frames from the movie made from the skeletal

data set. Again, the model is zoomed in and zoomed out while being rotated.

The foveated area is moved around the screen, revealing a constant area of

high resolution. The rest of the display falls off in resolution as prescribed by

the “foveated pyramid”. As with the other two examples, each viewport is

color coded according to the renderer that drew it.

The timings for the skeletal data set shown in figure 9.8 mirror the

results of the other two examples. All timings are flat regardless of the frame

number. The majority of renderers are balanced and grouped in the highest

137

Frame 3 Frame 100

Frame 199 Frame 300

Frame 341 Frame 384

Frame 481 Frame 591

Frame 685 Frame 718

Figure 9.7: Sample frames from the skeleton movie

138

0

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0 100 200 300 400 500 600 700

Sec

onds

Frame Number

Foveated Skeleton Movie Timings


Figure 9.8: Rendering times for skeleton movie frames

139

time line. The minority of renderers responsible for the medium and low

resolution areas of the screen are in the second and third highest respectively.

One possible criticism of the technique as presented in this paper is

that all of the rendering machines are not completely load balanced. While

this is not very obvious in the case of the engine model, from the graph of

the visible human in figure 9.4 it is evident that the timing lines are clumped

into three groupings. The first, at 0.81 seconds per frame, are the 7 renderers

that are doing the high resolution viewport. The second, at 0.27 seconds per

frame, are the 2 renderers doing the medium resolution viewport. Finally, at

0.11 seconds per frame is the single renderer responsible for the low resolution

viewport. While the renderers in each of these groups are load balanced among

themselves, as a whole they are not evenly balanced.

This should not be a concern. The majority of rendering servers are

assigned to the high resolution viewport and are load balanced among them-

selves. The minority of rendering servers doing the low and medium resolution

viewports may not have as much work to do, but because of their small num-

ber they will not greatly erode the overall efficiency of the algorithm. As

long as the workload assigned to the low and medium resolution viewports by

the progressive mesh is less than the workload of the primary high resolution

rendering servers, these few low and medium resolution renderers will always

be faster than the high resolution renderers. Thus, this imbalance will not

adversely affect the overall parallel timings.

140

9.5 Conclusion

The flat line timings of the foveated vision algorithm presented here

provide consistent frame rates no matter what the user viewpoint. However,

these initial results should be looked on only as the worst case timings possible

with this technique. Much faster timings than these could be possible with

efficient frustum culling.

Even though enough data is stored on all the machines in the system to

render each level of detail mesh in its entirety, obviously only a small portion

of those data sets is rendered at any one frame. This is because the majority

of those data sets are located outside of their area of their viewport for that

particular viewing angle. To avoid rendering these polygons, it is necessary to

employ a very efficient frustum culling algorithm. The frustum culler checks

polygons against the boundaries of the viewing area and eliminates extraneous

polygons from being sent to the OpenGL rendering stream. The more efficient

the algorithm, the better the speedup the foveated vision plugin will achieve.

Assarsson [1] discusses many of the methods used in fast frustum culling.

Employing efficient culling would improve the overall frame rate of the system.

One issue with frustum culling is the imbalances that could exists

among the different resolution viewports. For example, only machines that

have a particular sized decimated data set can render polygons to the cor-

respondingly sized viewport. In the example, this effectively means that the

cluster of rendering servers has been split into three groups, a high resolution

141

group, a medium resolution group, and a low resolution group. Because mem-

bers of these groups can not easily shift to help relieve loading pressures, in

some instances load imbalances will result. However, the worst case scenario

is if the user is looking at a region consisting of no polygons. In that case, the

majority of the rendering servers are rendering nothing. Even so, the medium

and low resolution rendering servers can at most render all the polygons which

they are assigned. Since the polygon count drops off exponentially this count

will still be bounded to a worst case frame rate. In the case of the example

presented in this paper, that upper bound would be 0.27 seconds.

This paper discusses foveated vision using a single user. In order to

support multiple viewers with multiple gazes, replication is necessary. Because

of the modulo distribution of polygons among the rendering server, a single

distributed data set can only render one viewport area. Trying to render

another viewport with that same data set would result in some polygons being

unavailable. To cope with this, it is necessary to have copies of each decimated

data set (except for the lowest resolution data set which covers the entire

display, of course) and a set of dedicated machines for each viewer. Replication

is typically not a good attribute to have when dealing with large data sets,

but considering that the number of users will typically be much lower than the

available machines that can render, this duplication does present an inordinate

problem for memory requirements.

142

Chapter 10

Conclusion and Future Work

This dissertation describes the architecture for a multiresolution mul-

tidisplay image composition system. It presents a simulator and emulator

for this architecture as a testbed. Finally, it illustrates the usefulness of

multiresolution for achieving high interactivity in parallel multidisplay image

compositing systems by providing two applications illustrating multiresolution

techniques with constant frame rates.

10.1 Summary

High resolution imagery and frame rates are usually a tradeoff in most

visualization applications. High resolution requires more computation time

yielding slower frame rates. Low resolution requires less computing power and

gives a faster display, but the image quality is not as good. A primary issue

is managing the balance between high resolution and frame rate in order to

provide the best interactivity for the user (chapter 1). The thesis of this dis-

sertation is that multiresolution can manage this balance effectively resulting

in higher levels of user interactivity than possible with other systems that do

143

not exploit this feature (chapter 2).

The multiresolution features of the Metabuffer supports adjusting the

tradeoff between resolution and frame rate in a dynamic manner by allowing

varying levels of detail and resolution in the same image (chapter 3). Machines

in a Metabuffer equipped rendering cluster can send their output imagery to

anywhere within the entire multitiled display space in the form of a viewport.

These viewports may overlap and can be of any resolution multiple.

To demonstrate that the architecture is viable, a simulator was written

which emulates the Metabuffer at the level of the bus clock tick (chapter 4).

Running test scenes using the simulator shows that the Metabuffer architecture

is able to generate glitch free output imagery. The bandwidth requirements

for any frame are constant throughout the entire compositing process and

thus do not overload the bus and starve any of the compositing pipelines.

The simulator also shows the capabilities of the antialiasing and transparency

features of the Metabuffer.

For dealing with more complex applications of the Metabuffer, an emu-

lator is demonstrated that mimics the working of the Metabuffer but is geared

to running on its host architecture as fast as possible (chapter 5). This allows

for a level of interactivity not possible with the simulator. The emulator is

currently running on a cluster of Linux machines connected to a 5 by 2 tiled

display wall in the visualization laboratory. However, the use of cross platform

libraries for communication and display needs should allow it to be ported to

144

almost any platform.

To support the Metabuffer emulator, a partitioning scheme is shown

that divides models into smaller groupings of triangles which can then be sent

to the individual rendering machines on the cluster (chapter 6). The emulator

is also provided with a wireless visualization control device implemented under

Windows CE using wireless Ethernet (chapter 7). The wireless devices allow

the user to have a high level of control over the emulator.

With the emulator in place, applications can be developed for the Meta-

buffer using its simple plugin API. These applications, once developed for the

emulator, will be easy to move to the Metabuffer hardware when it is imple-

mented. Two such applications are shown in this dissertation which exploit

the features of Metabuffer. The multiresolution capabilities of the Metabuffer

allow for a great deal of flexibility in allocating rendering resources to portions

of the screen. This is used in the foveated vision plugin (chapter 9) to concen-

trate the greatest level of detail and the most rendering machines where the

user is focused. The larger peripheral area is rendered in lower resolution and

detail with fewer machines.

The multiresolution features of the Metabuffer also assist greatly in

managing communication needs. This is demonstrated in the progressive im-

age composition plugin (chapter 8). Even in the face of fast changing user

viewpoints, frame rates remain steady. The rendering machines can switch

to lower resolutions instead of being forced to move polygon data or imagery

145

through the network. When the user stops to study an area, communica-

tion can then take place without penalty to interactivity and provide higher

resolution output.

10.2 Limitations of the Metabuffer

Because the bandwidth requirements of the Metabuffer must be even

throughout the entire rasterization of the display, there are limitations on the

sizes of the viewports that can be used. Viewports can only be of integer

multiples. This is because the bandwidth needs are lessened by using pixel

replication on the composer nodes. If the resolution multiples are not integer

values, pixel replication will not be as effective and this will mean higher

bandwidth needs that in some cases may swamp the bus. Also, if a viewport

is in low resolution, it can only be positioned on a location that is a multiple

of that resolution. Again, this is to assist pixel replication.

Pixel replication in general is another limitation of the Metabuffer.

While fast and simple, it yields blockiness at very low resolutions. To achieve

higher quality low resolution images some form of linear interpolation should be

used to smooth the replicated pixels. This would add greatly to the complexity

of the Metabuffer hardware but would not be impossible to add.

The Metabuffer requires non-trivial custom hardware in order to im-

plement. In this regard, the SHRIMP project which uses a standard cluster

and the Sepia project which uses COTS components would be easier to deploy.

146

However, the Lightning-2 board developed by Intel shares the same basic ar-

chitecture as the Metabuffer. By reprogramming the compositing nodes of this

board to have an on-board cache and do pixel replication it should be possible

to yield the multiresolution features of the Metabuffer. Unfortunately, there

are very few details about the Lightning-2 available to the public.

10.3 Limitations of the Applications

The demonstration applications included in this dissertation are in-

tended to show that the multiresolution features of the Metabuffer help keep

frame rates fast and consistent. In this regard, I feel that they have served

their purpose well, but there are many improvements that could be made to

both the progressive image composition and foveated vision plugin.

For the progressive image composition plugin, a polygon moving method

needs to be fully implemented in order to allow the renderers to generate high

resolution imagery over time. A strategy for this was outlined using a server

to keep track of blocks of polygons but has not been deployed.

In the case of the foveated vision plugin, there is currently no multiuser

support. In order to support multiple users, another set of machines with a

replicated data set would have to be devoted to that user as outlined in the

conclusion of chapter 9. This should not be difficult to implement.

147

10.4 Future Work

There are many avenues for future work on the Metabuffer project.

These areas can be divided into work on the hardware, the applications, and

the user interface.

For the hardware, the primary goal would be to create an actual pro-

totype. One way to do this would be to obtain the design of the current

Lightning-2 board and reprogram its FPGAs to reflect the operation of the

Metabuffer. Any changes needed to the Lightning-2 layout should be very

minimal, as the two architectures share the same crossbar design. The other

way would be to do an original design with the assistance of an outside vendor.

The Metabuffer hardware would need to be a high speed design in order to

keep up with the output of the graphics cards.

For the applications, the implementation of a polygon redistribution

server is still needed for the progressive image compositing plugin. This server

would direct the rendering machines where and when to send polygon data

between one another in order to create high resolution viewports for a partic-

ular viewpoint. For the foveated vision plugin, as stated previously, additional

work needs to be done in order to support multiple users.

There is a great deal of interest from Sandia National Labs to get the

Metabuffer adapted to use the WireGL, or Chromium, API from Stanford.

WireGL is essentially an extension of OpenGL which distributes polygons

148

from existing applications seemlessly to a cluster of rendering computers. It

is discussed in chapter 2.

Finally, many exciting uses could be explored with the wireless visual-

ization device. Currently the device only sends data to the cluster, but there

is no reason why the cluster could not transmit information back to the device

in order to give the user more visual queues or other information. There are

certainly many more topics to be dealt with in this area.

10.5 Conclusion

The benefits of multiresolution techniques vary in usefulness. Certainly

for the cases in which image quality is paramount, multiresolution techniques

will not be a valid option. However, for situations in which user interactivity

is an overriding concern, and rendering loads are large because of data set size

or complexity, multiresolution does provide fast, consistent frame rates when

used in the context of a parallel, multidisplay image compositing system such

as the Metabuffer.

149

Appendix

150

Appendix A

Simulator Classes

A.0.1 Class CClock

Public Members

CClock (int numthreads)

In the constructor the number ofthreads for the barrier is specified

~CClock ()

bool HL () Enter high to low clock transition

bool LH () Enter low to high clock transition

bool OutputReset () This function initiates a systemreset.

bool ReadOutputReset ()

This function reports if a reset istaking place.

Private Members

1.1 CClock Clock Transitions . . . . . . . . . . . . . . . . . . . . . . 152

151

1.2 CClock Reset Line . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 152

The CClock class emulates the rising and falling edge of the hardware

clock. It uses a barrier in order to synchronize the individual threads from

each component, just as the hardware would be synchronized with the clock

signal.

CClock Clock Transitions (1.1)

Names

int mnumthreads

CBarrier* HLBarrier

CBarrier* LHBarrier

In order to simulate the rising and falling edges of the clock, two barriers

constructed with pthreads primitives are used. One controls the high to low

transition and the other controls the low to high transition. The variable

mnumthreads specifies how many threads are to be blocked at both barriers

and is set in the constructor for CClock.

CClock Reset Line (1.2)

Names

152

bool mprevoutputreset

bool moutputreset

The reset line allows the system to be initialized. It is implemented here

because CClock is accessible to all the COutFrames which bubble it up the

pipeline. This isn’t very elegant, but it eliminates the need for another class

or additional code in the component classes. Two variables are used to keep

track of it. The variable moutputreset holds the currently latched state and

mprevoutputreset holds the newly latched state.

A.0.2 Class CComposerPipe

Public Members

CComposerPipe (ulong renderer, ulong display,CInFrameBus *bus,CComposerPipe *prev,CClock *clock)

Set up the position of this com-poser in the Metabuffer

~CComposerPipe ()

bool ReadPipe (ulong *highdata, ulong *lowdata,bool *control)

This function fetches data fromthe previous composer in thepipeline.

bool SetPipeReady (bool pipeready)

153

This function is called by the fol-lowing composer to bubble thiscomposer’s pipeready.

bool SetPipeReset (bool pipereset)

This function is called by the fol-lowing composer to bubble up thiscomposer’s pipereset.

void DoBusIO () Perform housekeeping tasks formonitoring the bus

Private Members

long mticks Stores the number of clock ticksthat have occurred.

CComposerQueue*

mqueue The queue used for pixel replica-tion.

2.1 CComposerPipe Composer Position . . . . . . . . . . 155

2.2 CComposerPipe Pipeline Readable Data . . . . 156

2.3 CComposerPipe Pipeline Writable Data . . . . . 156

2.4 CComposerPipe Bus Variables . . . . . . . . . . . . . . . . 157

2.5 CComposerPipe Pipeline Variables . . . . . . . . . . . 158

2.6 CComposerPipe Thread Functions . . . . . . . . . . . 158

The CComposerPipe class simulates the composers in the pipeline. It

takes data in from the CInFrameBus and, if it is responsible for a pixel in the

154

display, compares that to data coming down the compositing pipeline from

previous CComposerPipes. A lot of this code implements the operations of

the pipeline. Many of the variables are in pairs to simulate the latching of

data. In the constructor, the CComposerPipe is initialized. It is given which

renderer (row) it is responsible for and which display (column) it is creating

a pipeline to drive. It is given a pointer to the renderer’s CInFrameBus class

in order to grab data off the bus as well as a pointer to the CComposerPipe

above it in order to communicate data on the pipeline. Finally, a pointer to

the global CClock is given for clock transitions.

CComposerPipe Composer Position (2.1)

Names

ulong mrenderer

ulong mdisplay

CInFrameBus*

mbus

CComposerPipe*

mprev

CClock* mclock

In order to communicate correctly with the other components in a Metabuffer,

it is necessary to know where this instance of the composer has been placed

and how to talk to the other components in the system. These values are

155

initialized in the constructor. Here the number of the renderer and display are

recorded. Pointers also exist to the previous CComposerPipe in the pipeline

and the CInFrameBus for data exchange. Finally, the CClock is included for

clock transitions.

CComposerPipe Pipeline Readable Data (2.2)

Names

ulong mhighpipe

ulong mlowpipe

bool mcontrol

This data is the latched in data owned by this instance of the CComposer-

Pipe. It consists of the mhighpipe and mlowpipe values which normally store

RGB and Z information, along with an mcontrol bit which specifies if control

information is being sent over highpipe and lowpipe instead.

CComposerPipe Pipeline Writable Data (2.3)

Names

bool mpipeready

bool mprevpipeready

bool mpipereset

bool mprevpipereset

156

In order to simulate a latch in of the pipeready bit as it is being bubbled up

the pipeline, two values are used. The bit mpipeready is the new value and

mprevpipeready is the currently latched in value. A similar convention is used

for mpipereset and mprevpipereset.

CComposerPipe Bus Variables (2.4)

Names

bool mbusready This value is used to tell the busto abort a send using IRSA.

ulong mbusstate This value keeps track of what op-eration the bus is currently per-forming.

bool msendingviewports

State bit to identify when view-ports are being transmitted overthe bus.

ulong mviewindex Value to keep track of viewportcopying

VIEWPORT

mNewViewPort Datastructure used to store theviewport that the composer is re-sponsible for.

Several variables are used in communicating with the bus controlled by the

CInFrameBus instance. The bit mbusready is used by the composer to deter-

mine if an IRSA needs to be sent to the CInFrameBus. The variable mbusstate

157

keeps track of the current bus operation. The bit msendingviewports keeps

track of whether the bus is currently sending viewports over the bus. During

this period, the viewport that the composer is responsible for may be sent.

The variables mviewindex and mNewViewPort are used to copy the viewport

to the composer’s local memory.

CComposerPipe Pipeline Variables (2.5)

Names

int mstate

ulong mdispcoords

These variables control the operation of the pipeline in the composer. The

variable mstate is the current condition of the pipeline. It tells whether the

composer should be transmitting data, waiting for a pipeready to bubble up,

etc. The variable mdispcoords records the overall location in the display. The

composer checks against this variable to determine if its viewport is currently

within the correct range to send pixels.

CComposerPipe Thread Functions (2.6)

Names

DWORD ThreadProc ()

static void*

158

StaticThreadProc ( void * parg )

DWORD StartThread ()

To allow each class its own thread to run in, a few special calls need to be

implemented in C++. StartThread is called from the constructor and creates

the thread. In order for the system to be able to call back into the class, a

static function needs to be defined called StaticThreadProc. StaticThreadProc

takes the class instance as an argument and then calls back into ThreadProc.

A.0.3 Class CComposerQueue

Public Members

CComposerQueue ()

~CComposerQueue ()

bool Get (ulong x, ulong y, ulong *highdata,ulong *lowdata)

This function provides pixels fromthe queue if x and y are in theviewport.

bool Put (ulong highdata, ulong lowdata)

Put the data received from the businto the queue when it belongs tothe composer.

bool BufferIsPrefetched (VIEWPORT *vp)

159

This function assigns the newviewport and makes sure thequeue is full before starting.

void Reset () Clears out the queue. Calledwhen a reset signal bubbles up thepipeline.

Private Members

ulong* mbuffer This is the buffer allocated to holdthe FIFO queue.

ulong mbuffstart The start of the queue.

ulong mbuffend The tail end of the queue. Notethat some room is left for old datatoo!

long mbufflen The amount of data stored in thequeue.

VIEWPORT

mViewPort The current viewport that is beingworked with.

VIEWPROGRESS

mViewProgress How much progress has been madewith the current viewport.

bool BufferFull () If bufflen is greater than the sizeof buffer this is TRUE.

bool AllDataFetched (VIEWPORT *vp)

If the buffer is full entirely withdata from the current viewportthis is TRUE.

160

The CComposerQueue class is a special version of a FIFO queue. Es-

sentially it acts like a normal queue except for one important distinction. The

data elements of the queue can be accessed (but not removed) from the queue

at any time. This allows the CComposerPipe classes to do pixel replication.

The queue buffers data coming into the CComposerPipe so that multiple data

accesses for multiresolution are not a problem. It also saves at least one line

of previous imagery so that replication can be done by accessing those old

members.

A.0.4 Class CInFrameBus

Public Members

CInFrameBus (int renderer, CClock *clock)

Set up the position of this CIn-FrameBus in the Metabuffer

~CInFrameBus ()

bool LoadFrame (char *szTIF, char *szZ,ulong NumViewports,VIEWPORT *ViewportArray,BOOL bShowViewport, int count)

161

This function reads in new im-agery from the disk.

bool LoadViewsWithoutFrame ( ulongNumViewports,VIEWPORT*ViewportArray,int count)

This function is mainly used fortesting viewport locations.

bool ReadBus (ulong *highdata, ulong *lowdata,bool *control)

Called by the composers to fetchdata from the bus..

bool SetBusReady (bool busready)

Called by the composers to pulldown the busready bit.

bool SetBusReset (bool busreset)

Called by the composers to pulldown the busreset bit.

bool SetMastersSynced (bool masterssynced)

Called by the composers to pulldown the masterssynced bit.

bool GetMastersSynched ()

If no composer has pulled it down,things are synced!

Private Members


4.1 Viewport Information . . . . . . . . . . . . . . . . . . . . . . . . . . 163

162

4.2 CInFrameBus Bus Variables . . . . . . . . . . . . . . . . . . . 164

4.3 CInFrameBus Position . . . . . . . . . . . . . . . . . . . . . . . . . 165

4.4 CInFrameBus Double Buffering . . . . . . . . . . . . . . . 165

4.5 CInFrameBus Bus Readable Data . . . . . . . . . . . . 167

4.6 CInFrameBus Bus Writable Data . . . . . . . . . . . . . 167

4.7 CInFrameBus Thread Functions . . . . . . . . . . . . . . 168

The CInFrameBus represents the graphics card sending data to the

double buffered viewport which then transmits it to the composers over the

bus. The constructor for this class specifies the renderer that it is responsible

for and also gives a pointer to the global clock for clock transitions.

Viewport Information (4.1)

Names

VIEWPORT

mViewPortArray [10]

VIEWPROGRESS

mViewProgressArray [10]

VIEWPROGRESS

163

mViewProgress1

ulong mviewportindex1

ulong mnumviewports

ulong mviewportindex

These variables are used to store viewport information recorded from the en-

coding on the imagery, as well as record the progress of data sent to the

composers for each viewport. The variable mViewPortArray stores the actual

viewport for each display. The variable mViewProgressArray shows how each

viewport has been serviced. The variables mViewProgress1, mViewProgress2,

mviewportindex1, and mviewportindex2 are implemented as roll back mech-

anisms when an IRSA event occurs. A few pixels will be dropped in these

cases, so it is necessary to always store the state of the last two operations.

CInFrameBus Bus Variables (4.2)

Names

int mstate

bool mgoaheadandsend

ulong msendindex

ulong msendlength

164

In order to keep track of the operations of the bus, a few variables are needed

to store state. The variable state tells what operation the bus is in currently.

The bit goaheadandsend means that the frame buffer has been loaded and

swapped for the next image send. The variables sendindex and sendlength are

both used to assist in transmitting viewport structures.

CInFrameBus Position (4.3)

Names

int mrenderer

CClock* mclock


it is necessary to know where this instance of the composer has been placed.

These values are initialized in the constructor. Here the number of the renderer

is recorded. The CClock is included for clock transitions.

CInFrameBus Double Buffering (4.4)

Names

unsigned char*

mbuff1

unsigned char*

165

mbuff2

unsigned char*

minbuff

unsigned char*

moutbuff

unsigned char*

mzbuff1

unsigned char*

mzbuff2

unsigned char*

minzbuff

unsigned char*

moutzbuff

CMutex* DoubleBuffMutex

ulong mframecount

One of the main jobs of the CInFrameBus is to double buffer the input imagery

from the graphics cards. The composers require that the input imagery be

accessed in a random fashion. Since DVI only provides the data in raster line

order, a full screen must be double buffered. These variables achieve that.

Note that Lightning-2 avoids this by rearranging the screen on the graphics

card for the proper ordering. This results in a loss of throughput, but allows

for much simpler hardware and may actually be the best way to implement

166

this.

CInFrameBus Bus Readable Data (4.5)

Names

ulong mhighbus

ulong mlowbus

bool mcontrol

These values are placed on the bus by this instance of CInFrameBus for the

other composers on the bus to read. The variables mhighbus and mlowbus are

typically used to transmit RGB and Z information, although the mcontrol bit

can specify that control information is being passed instead.

CInFrameBus Bus Writable Data (4.6)

Names

bool mbusready

bool mprevbusready

bool mbusreset

bool mprevbusreset

bool mmasterssynced

bool mprevmasterssynced

167

In order to simulate a pulldown line on the bus, pairs of variables are used

for mbusready, mbusreset, and mmasterssynced. Each pair consists of a new

value as a result of a pulldown, and a currently latched value.

CInFrameBus Thread Functions (4.7)

Names

DWORD ThreadProc ()

static void*

StaticThreadProc ( void *parg )


To allow each class its own thread to run in, a few special calls need to be im-

plemented in C++. StartThread is called from the constructor and creates the

thread. For the system to be able to call back into the class, a static function

needs to be defined called StaticThreadProc. StaticThreadProc takes the class

instance as an argument and then calls back into ThreadProc.

A.0.5 Class COutFrame

Public Members

COutFrame (int display, CComposerPipe *prev,CClock *clock)

168

Initialize the position of the framebuffer

~COutFrame ()

bool SaveImage (char *szTIF)

This function saves the framebuffer to a TIF image.

Private Members


5.1 COutFrame Position . . . . . . . . . . . . . . . . . . . . . . . . . . . 170

5.2 COutFrame Frame Buffer Variables . . . . . . . . . . 170

5.3 COutFrame Pipeline Variables . . . . . . . . . . . . . . . . 171

5.4 COutFrame TIF Variables . . . . . . . . . . . . . . . . . . . . . 172

5.5 COutFrame Queue Variables . . . . . . . . . . . . . . . . . . 172

5.6 COutFrame Thread Functions . . . . . . . . . . . . . . . . . 173

The COutFrame class simulates the output frame buffer. At the end

of each compositing pipeline, it is responsible for gathering the composited

imagery, down-sampling by averaging the 4 neighboring pixels (for supersam-

pling), and then displaying it on the tiled monitors. In the constructor COut-

Frame is given which display it is, the pointer to the CComposerPipe directly

above it (so that it can read data from the pipeline and send data back up),

169

and the global clock (so that it can enter into clock transitions and read the

reset line.

COutFrame Position (5.1)

Names

int mdisplay Display number as specified byconstructor.

CComposerPipe*

mprev CComposerPipe directly aboveframe buffer for pipeline readsand writes.

CClock* mclock Global clock for clock transitionsand RESET line information.


it is necessary to know where this instance of the COutFrame has been placed

and how to talk to the other components in the system. These values are

initialized in the constructor. Here the number of the display is recorded.

Pointers also exist to the previous CComposerPipe in the pipeline for data

exchange. Finally, the CClock is included for clock transitions.

COutFrame Frame Buffer Variables (5.2)

Names

unsigned char*

170

mbuff

ulong mbuffindex

All of that data has to go somewhere! The buffer mbuff stores the output

image in the frame buffer. The variable mbuffindex is the current index into

the frame buffer as the data comes out in raster line order.

COutFrame Pipeline Variables (5.3)

Names

bool mpipeready The value of the PIPEREADYsignal on the pipeline.

bool mpipereset The value of the PIPERESETsignal the frame buffer will bubbleup the pipeline.

bool mwaitforsentinel

Keeps track of when a frame hasbeen finished but the next hasn’tstarted.

Several variables are used in communicating with the pipeline controlled by

the CComposerPipe instance. The bit mpipeready is bubbled up the pipeline

when the frame buffer is ready for more data. Likewise, mpipereset is bubbled

up the pipeline when a reset has been detected from the CClock instance.

The bit mwaitforsentinel keeps track of when a pipeready has been sent up

the pipeline but an acknowledgement that the composers are synced and ready

171

to send hasn’t been received.

COutFrame TIF Variables (5.4)

Names

CMutex* SaveMutex

int mpicindex

The libtif library isn’t thread safe, so this CMutex is used to guard against

corrupting any of its internal data structures. The variable mpicindex is the

index used to mark the name of the TIF file to save.

COutFrame Queue Variables (5.5)

Names

unsigned char*

mqueue

ulong mqueueindex

In order to down-sample the four neighboring pixels into one supersampled

pixel, it is necessary to store the previous line of pixels. These variables form

a queue that always has the last line of pixels in memory.

172

COutFrame Thread Functions (5.6)

Names

DWORD ThreadProc ()

static void*

StaticThreadProc ( void *parg )


To allow each class its own thread to run in, a few special calls need to be

implemented in C++. StartThread is called from the constructor and creates

the thread. For the system to be able to call back into the class, a static func-

tion needs to be defined called StaticThreadProc. StaticThreadProc takes the

class instance as an argument and then calls back into ThreadProc.

173

Appendix B

Emulator Distribution

B.1 Contents

There are two main tar files in this distribution:

• meta.tar.gz

• metadata.tar.gz

Meta.tar.gz contains a slightly modified version of the GLUT library,

the TIFF library, the OCview library, the Metabuffer emulator code, a plugin

directory containing some different examples of plugins for the Metabuffer

emulator, and a tools directory with metaload.c to split large data sets into

pieces using the greedy viewport allocation algorithm for the progressive image

composition plugin, metascatter.c which divides a data set in a modulo manner

suitable for processing with the foveated vision plugin, and metapaste.c to

piece the tiled display output images back together to form movies.

The initial plugin.cpp file in the Metabuffer emulator code (teapot.cpp)

does not rely on the metadata.tar.gz contents. To try either progressive.cpp

174

(progressive image composition), fovea.cpp (foveated vision), or the simpler

ducksetal.cpp (a couple OCview models bouncing around the display), the

metadata file is needed.

B.2 Building the Metabuffer Emulator

In order to create the Metabuffer emulator it is necessary to follow this

build process. Because some of the libraries have dependencies on the others,

build the libraries in this order.

B.2.1 glut-3.7

This is a slightly modified version of the GLUT library. The main

changes here are the addition of a call back function, glutMainLoopUpdate(),

in glut event.c to process GLUT commands in the single threaded MPICH

processes instead of resorting to the glutMainLoop() endless loop. The code

is shown below in case an updated version of GLUT needs to be modified.

void APIENTRY

glutMainLoop(void)

{

for(;;)

glutMainLoopUpdate();

}

175

/* CENTRY */

void APIENTRY

glutMainLoopUpdate(void)

{

#if !defined(_WIN32)

if (!__glutDisplay)

__glutFatalUsage("main loop entered with out proper

initialization.");

#endif

if (!__glutWindowListSize)

__glutFatalUsage(

"main loop entered with no windows created.");

{

if (__glutWindowWorkList) {

GLUTwindow *remainder, *work;

work = __glutWindowWorkList;

__glutWindowWorkList = NULL;

if (work) {

remainder = processWindowWorkList(work);

if (remainder) {

*beforeEnd = __glutWindowWorkList;

__glutWindowWorkList = remainder;

}

}

}

if (__glutIdleFunc || __glutWindowWorkList) {

176

idleWait();

} else {

if (__glutTimerList) {

waitForSomething();

} else {

processEventsAndTimeouts();

}

}

}

}

/* ENDCENTRY */

The function glutMainLoopUpdate() is used instead of the standard

GLUT message loop that is called at the very end of most GLUT programs.

Essentially it includes all the glutMainLoop() code except for the endless for

loop. The glutMainLoop() function remains for completeness. This addition

was done because the version of MPICH used on the Prism cluster does not

support multiple threads in a process. Therefore, the call back glutMainLoop-

Update() function is called periodically by the main thread to process the

GLUT messages.

The makefile here is modified and configured to generate a static library

instead of the shared library that GLUT normally would create. Because of

the GLUT modification above, this version probably should not be installed

as the system version of GLUT. This way only the Metabuffer emulator will

177

be linked to it.

To create the GLUT library, go into the glut-3.7/lib/glut directory and

type:

make

This will generate the libglutmpi.a file that is the static library the

Metabuffer emulator will link against.

B.2.2 tiff-v3.5.5

This is an unmodified standard distribution of the TIFF image library.

It is used to save output from the Metabuffer emulator for remote debugging,

to generate movies, or to make images for papers or reports.

To create the TIFF library, go into the tiff-v3.5.5/libtiff directory and

type:

make

mv libtiff.a libtiffz.a

This will generate the libtiff.a file that is the static library the Meta-

buffer will link against. Rename this file libtiffz.a in order for the Metabuffer

makefile to work with it (for some reason the Prism machines were aliasing

this with another tiff library).

178

B.2.3 ocview

OCview is an out of core renderer developed at the University of Texas

at Austin. Currently Xiaoyu Zhang maintains it. OCview allows images to

be generated from data that can be much larger than the amount of memory

in the system by fetching that data from secondary storage. For most runs,

usually the data is kept small enough to just fit within system memory. Still,

this capability exists for even larger data sets when there are not enough

machines available for splitting the data set.

To create the OCview library, go into the ocview directory and type:

make

B.2.4 emu

After the previous three libraries have been built, it is now time to

build the actual emulator. First, it is necessary to tell the emulator code how

the system is laid out. Go into the emu directory and edit the enviro.h file.

• Set DISPX and DISPY to the resolution of the rendering and display

machines. At UT, this is 800 by 600.

• Set NUMOUTX and NUMOUTY to the tiling configuration of the pro-

jectors. The UT visualization lab has a 5 by 2 tiled display wall.

179

• Set NUMINPUTS to the number of rendering machines that are being

used. All the plugins in this distribution rely on 10 rendering machines,

though they should work with varying tile configurations. If 10 machines

aren’t available, a few changes to the plugin.cpp code might be needed.

• Set szBindings to the names (gethostname()) of the machines that drive

the display. Order these names left to right, top to bottom according to

how they are laid out in the tiled display wall.

• Set szHomeDir to the directory that contains the meta and metadata

directories. This is used for the progressive.cpp, fovea.cpp, and duckse-

tal.cpp plugins in order to find their data sets. In the distribution it is

in the ˜wjb home directory.

After that is done type:

make

This will create the Metabuffer emulator (meta.exe) and link it to the

previous three libraries.

B.3 Running the Metabuffer Emulator

In order to run the Metabuffer emulator at UT, type:

180

mpirun.mpich -arch PROJ -np 10 \

-arch NONPROJ -np 10 \

/home/wjb/meta/emu/meta

On the UT cluster, the mpirun command is named mpirun.mpich. Oth-

ers may be different. The -arch commands specify a machine list file (on the

Prism cluster located in /usr/lib/mpich/util/machines). machines.PROJ con-

tains a list of the 10 machines hooked up to the projectors. -np 10 specifies

to use all 10 of them (obviously!). Similarly machines.NONPROJ specifies all

the machines that aren’t connected to displays, 22 in all. We need only 10 of

those. This just forces MPI to use all the projector machines and then select

from the rest, so set these things to the display wall configuration. MPI really

wants a full path to the executable, so replace /home/wjb/meta/emu/meta

with wherever the meta.exe file happens to reside.

181

Bibliography

[1] Assarsson, U., and Moller, T. Optimized view frustum culling

algorithms. Tech. rep., Chalmers University of Technology, March 2000.

[2] Bajaj, C. L., Pascucci, V., Rabbiolo, G., and Schikore, D. R.

Hypervolume visualization: A challenge in simplicity. In IEEE Sympo-

sium on Volume Visualization (1998), pp. 95–102.

[3] Blanke, W. Multiresolution Techniques on a Parallel Multidisplay Mul-

tiresolution Image Compositing System. PhD thesis, University of Texas

at Austin, 2001.

[4] Blanke, W., Bajaj, C., Fussell, D., and Zhang, X. The meta-

buffer: A scalable multiresolution multidisplay 3-d graphics system using

commodity rendering engines. Tr2000-16, University of Texas at Austin,

February 2000.

[5] Blanke, W., Bajaj, C., Zhang, X., and Fussell, D. A cluster

based emulator for multidisplay, multiresolution parallel image composit-

ing. Tech. rep., University of Texas at Austin, April 2001.

[6] Bunker, M., and Economy, R. Evolution of GE CIG systems. SCSD

Document (1989).

182

[7] C. Cruz-Neira, D. J. S., and DeFanti, T. A. Virtual reality: The

design and implementation of the cave. Computer Graphics 27, 4 (August

1993), 135–142.

[8] Chen, Y., Clark, D., Finkelstein, A., Housel, T., and Li, K.

Automatic alignment of high resolution multi-projector displays using an

un-calibrated camera. In Proceedings of IEEE Visualization Conference

(2000), pp. 125–130.

[9] Coren, S., Ward, L., and Enns, J. Sensation & Perception. Har-

court Brace, New York, NY, 1999.

[10] Crockett, T. W. Parallel rendering. Tech. rep., ICASE, 1995.

[11] Eldridge, M., Igehy, H., and Hanrahan, P. Pomegranate: A fully

scalable graphics architecture. Computer Graphics (SIGGRAPH 2000

Proceedings) (2000), 443–454.

[12] Eyles, J., Molnar, S., Poulton, J., Greer, T., Lastra, A.,

England, N., and Westover, L. Pixelflow: The realization. In Pro-

ceedings of the Siggraph/Eurographics Workshop on Graphics Hardware

(August 1997), pp. 57–68.

[13] Ferrari, F., Nielsen, J., Questa, P., and Sandini, G. Space

variant imaging. Sensor Review 15, 2 (1995), 17–20.

[14] Fitzmaurice, G. Situated information spaces and spatially aware palm-

top computers. Communications of the ACM 36, 7 (July 1993).

183

[15] Foley, J., van Dam, A., Feiner, S., and Hughes, J. Computer

Graphics: Principles and Practice. Addison-Wesley Publishing Com-

pany, Reading, MA, 1990.

[16] Forrest, A. R. Antialiasing in progress. Fundamental Algorithms for

Computer Graphics 17 (1985), 113–134.

[17] Fussell, D. S., and Rathi, B. D. A vlsi-oriented architecture for

real-time raster display of shaded polygons. In Graphics Interface ’82

(May 1982).

[18] Gandhi, R., Khuller, S., and Srinivasan, A. Approximation al-

gorithms for partial covering problems. In Proceedings of ICALP 2001

(July 2001).

[19] Geisler, W., and Perry, J. Variable-resolution displays for visual

communication and simulation. The Society for Information Display 30

(1999), 420–423.

[20] Hanrahan, P. Scalable graphics using commodity graphics systems.

Views pi meeting, Stanford Computer Graphics Laboratory, Stanford Uni-

versity, May 17, 2000.

[21] Heirich, A., and Moll, L. Scalable distributed visualization using off-

the-shelf components. In Parallel Visualization and Graphics Symposium

– 1999 (San Francisco, California, October 1999), J. Ahrens, A. Chalmers,

and H.-W. Shen, Eds.

184

[22] Hochbaum, D. Approximation Algorithms for NP-Hard Problems. PWS

Publishing Company, Boston, MA, July 1996.

[23] Hoppe, H. Smooth view-dependent level-of-detail control and its appli-

cation to terrain rendering. In IEEE Visualization 1998 (October 1998),

pp. 35–42.

[24] Humphreys, G., and Hanrahan, P. A distributed graphics system

for large tiled displays. In Proceedings of IEEE Visualization Conference

(1999), pp. 215–223.

[25] id software. Quake. http://www.quake.com.

[26] Johnson, R. Pthreads-win32. http://sources.redhat.com/pthreads-

win32/.

[27] Kettler, K. A., Lehoczky, J. P., and Strosnider, J. K. Mod-

eling bus scheduling policies for real-time systems. In Proceedings of

16th IEEE Real-Time System Symposium (1995), IEEE Computer Soci-

ety Press, pp. 242–253.

[28] Kilgard, M. Glut. http://reality.sgi.com/opengl/glut3/.

[29] Lamming, M., Brown, P., Carter, K., Eldridge, M., Flynn, M.,

Louie, G., Robinson, P., and Sellen, A. The design of a human

memory prosthesis. The Computer Journal 37, 3 (1994).

[30] Leffler, S. Libtiff. http://www.libtiff.org/.

185

[31] Lombeyda, S., Moll, L., Shand, M., Breen, D., and Heirich, A.

Scalable interactive volume rendering using off-the-shelf components. In

Proceedings of IEEE 2001 Symposium on Parallel and Large-Data Visu

alization and Graphics (2001), IEEE Computer Society Press, pp. 115–

121.

[32] Magillo, P., Floriani, L. D., and Puppo, E. A dimension and

application-independent library for multiresolution geometric modeling.

Tech. Rep. DISI-TR-00-11, University of Genova, Italy, 2000.

[33] Majumder, A., He, Z., Towles, H., and Welch, G. Achieving

color uniformity across multiprojector displays. In Proceedings of IEEE

Visualization Conference (2000), pp. 117–124.

[34] Mammen, A. Transparency and antialiasing algorithms implemented

with the virtual pixel maps technique. IEEE Computer Graphics and

Applications 9, 4 (July 1989), 43–55.

[35] Microsoft. Windows ce embedded visual tools. http://www.microsoft.com/

mobile/downloads/emvt30.asp.

[36] Molnar, S., Cox, M., Ellsworth, D., and Fuchs, H. A sort-

ing classification of parallel rendering. IEEE Computer Graphics and

Applications 14, 4 (July 1994).

[37] Molnar, S. E. Combining z-buffer engines for higher-speed rendering.

In Proceedings of the 1988 Eurographics Workshop on Graphics Hardware

186

(1988), Eurographics Seminars, pp. 171–182.

[38] Molnar, S. E. Image composition architectures for real-time image

generation. Phd dissertation, technical report tr91-046, University of

North Carolina, 1991.

[39] Moreland, K., Wylie, B., and Pavlakos, C. Sort-last parallel ren-

dering for viewing extremely large data sets on tile displays. In Proceed-

ings of IEEE 2001 Symposium on Parallel and Large-Data Visu alization

and Graphics (2001), IEEE Computer Society Press, pp. 85–92.

[40] Muraki, S., Ogata, M., Ma, K.-L., Koshizuka, K., Kajihara,

K., Liu, X., Nagano, Y., and Shimokawa, K. Next-generation

visual supercomputing using pc clusters with volume graphics hardware

devices. In Supercomputing 2001 (2001).

[41] Pardo, F., and Martinuzzi, E. Hardware environment for a retinal

ccd visual sensor. In EU-HCM SMART Workshop: Semi-autonomous

Monitoring and Robotics Technologies (April 1994).

[42] Raskar, R., Brown, M., Yang, R., Chen, W., Welch, G., Towles,

H., Seales, B., and Fuchs, H. Multi-projector displays using cam-

era based registration. In Proceedings of IEEE Visualization Conference

(1999), pp. 161–168.

[43] Saito, N., and Beylkin, G. Multiresolution representations using

the auto-correlation functions of compactly supported wavelets. IEEE

187

Transactions on Signal Processing 41 (December 1993), 3584–3590.

[44] Samanta, R., Zheng, J., Funkhouser, T., Li, K., and Singh,

J. P. Load balancing for multi-projector rendering systems. In SIG-

GRAPH/Eurographics Workshop on Graphics Hardware (August 1999).

[45] Schneider, B.-O. Parallel rendering on pc workstations. In Paral-

lel and Distributed Processing Techniques and Applications (July 1998),

pp. 1281–1288.

[46] SGI. Opengl. http://www.opengl.org.

[47] Shamir, A., Pascucci, V., and Bajaj, C. Multi-resolution dynamic

meshes with arbitrary deformations. Tech. Rep. TICAM 00-07, Univer-

sity of Texas at Austin, March 2000.

[48] Shapiro, J. M. Embedded image coding using zerotrees of wavelet co-

efficients. IEEE Transactions on Signal Processing 41 (December 1993),

3445–3462.

[49] Shaw, C. D., Green, M., and Schaeffer, J. A vlsi architecture for

image composition. In Proceedings of the 1988 Eurographics Workshop

on Graphics Hardware (1988), Eurographics Seminars, pp. 183–199.

[50] Weinberg, R. Parallel processing image synthesis and anti-aliasing.

Computer Graphics 15, 3 (July 1981), 55–61.

188

[51] Weiser, M. Some computer science issues in ubiquitous computing.

Communications of the ACM 36, 7 (July 1993), 65–84.

[52] Wodnicki, R., Roberts, G., and Levine, M. A foveated image

sensor in standard cmos technology. In Custom Integrated Circuits Con-

ference (1995).

[53] Zhang, X., Bajaj, C., and Blanke, W. Scalable isosurface visu-

alization of massive datasets on cots clusters. In Proceedings of IEEE

2001 Symposium on Parallel and Large-Data Visualization and Graphics

(2001), IEEE Computer Society Press, pp. 51–58.

189

Vita

William John Blanke was born in Charlotte, North Carolina on May

21, 1972 to Dianne Kiser Blanke and Robert John Blanke. After graduating

from Charlotte Latin School in 1990, he attended Duke University. During

this time he took summer course work from The University of North Carolina

at Charlotte and interned at the North Carolina Supercomputing Center un-

der a National Science Foundation undergraduate fellowship. He graduated

from Duke in 1994 with a Bachelor of Science in Engineering degree, triple

majoring in electrical engineering, computer science, and history. Afterwards,

he attended The University of Virginia earning a Master of Science degree in

electrical engineering in 1996. Following this, he was employed by PrivNet,

Inc., an Internet startup company which was subsequently bought by PGP,

Inc., a cryptography firm. In 1997, he attended The University of Texas at

San Antonio as a non-degree seeking student. In 1998, he enrolled in The

University of Texas at Austin as a Ph.D. student in computer engineering.

Permanent address: 2932 Houston Branch RoadCharlotte, NC 28270

This dissertation was typeset with LATEX† by the author.

†LATEX is a document preparation system developed by Leslie Lamport as a specialversion of Donald Knuth’s TEX Program.

190

Copyright by William John Blanke 2001

Documents

Transcript of Copyright by William John Blanke 2001