Scheduling and Data Management for Parallel Ray Tracing. Erik Reinhard

Scheduling and Data Management for Parallel Ray Tracing. Erik Reinhard A dissertation submitted to the University of Bristol in accordance with the ...
Author: Anis Morrison
5 downloads 2 Views 5MB Size
Scheduling and Data Management for Parallel Ray Tracing.

Erik Reinhard

A dissertation submitted to the University of Bristol in accordance with the requirements of the degree of Doctor of Philosophy in the Faculty of Engineering, Department of Computer Science.

October 1999

45668

i

Abstract Parallelising ray tracing with a data parallel approach allows rendering of arbitrarily large models, but the inherent load imbalances may lead to severe inefficiencies. To compensate for the uneven load distribution, demand-driven tasks may be split off and scheduled to processors that are less busy. We propose a hybrid scheduling algorithm which brings tasks and data together according to coherence between rays. Coherent tasks are scheduled demand driven and the remainder is executed data parallel. This method removes the worst hot-spots from the data parallel component and reschedules those as demand driven tasks, thereby evening out the workload. Processing power, communication and memory are three resources which should be evenly used. Our current implementation is assessed against these requirements. Related issues, such as the distribution of the workload over space and the resulting requirements for the distribution objects over the processors, are investigated as well. Finally, an assessment is made of the algorithm’s ability to deal with complexity in the form of large amounts of geometry and difficult lighting conditions in the form of diffuse inter-reflection calculations.

ii

Author’s declaration I declare that the work in this dissertation was carried out in accordance with the Regulations of the University of Bristol. The work is original except where indicated by special reference in the text. No part of the dissertation has been submitted for any other degree, with the exception of chapter 5. The work in this chapter was initiated by Pim van der Wal, Wim de Leeuw and Frederik W. Jansen [70], then modified for parallel processing [78] and subsequently submitted as part of the work done for my TWAIO certificate [48]. Any views expressed in the dissertation are those of the author and in no way represent those of the University of Bristol. The dissertation has not been presented to any other University for examination either in the United Kingdom or overseas.

SIGNED:

DATE:

iii

Preface The work described in this thesis is concerned with exploring scheduling techniques in order to allow large scenes to be efficiently rendered. If you think that parallel rendering is now a solved problem, think again. However, this thesis does investigate the problems associated with it and attempts to show possible routes to getting closer to solving them. The work presented here was started at Delft University of Technology under the supervision of Professor Frederik Jansen. A large number of discussions and arguments with him have laid out the groundwork for this research. Quite invaluable stuff went on there, as you can imagine. Working in the lab was made most pleasant by those who also happened to do a PhD there. I would like to thank Wim de Leeuw, Klaas Jan de Kraker, Maurice Dohmen, Winfried van Holland (that trick with the tray did not work in Bristol: they had carpet on the floor), Ari Sadarjoen, Freek Reinders, Arjan Kok, Theo van Walsum, Rafael Bidarra and Andrea Hin. Part of the work was sponsored by TNO Physics and Electronics Laboratory, who provided an alternative working environment, albeit equally pleasant. Joost van Lawick van Pabst and Arjan Kok (again) provided the counter arguments there as well as a great working environment. Then my work ended in Holland and was continued/restarted at the University of Bristol. In comes Alan Chalmers as my new supervisor and managed me very well indeed. A found myself in a very pleasant working environment which for a substantial period of time was located at various workshops and conferences (SIGGRAPH perhaps being the pinnacle). It’s rather fantastic to meet the graphics community in all these different places. I should mention the Irish, the French, the Americans, the Belgians, the Germans, the Austrians and the Spanish, to name just a few. This would not have happened without Alan. At work (which happened occasionally here at Bristol), Frederik Jansen kept playing a very important role. Leaving home and making new friends proved very easy. It was facilitated by another large contingent of people: Chris Setchell (known to fall off his chair at regular intervals – strange), Jackson Pope, Nigel Jewell, Shane Dickson, Matt Wood, Costas Veropoulos (our software installation expert), Zuki Jakavula, Katerina Mania, Claire Kennedy and Mike “Mumble” Evans are just a few who provided inspiration (and laughs). Didn’t I mention David Gibson, John Napier, Angus Clark and Mark Everingham? Well, that was because they lived in a different lab. They did, however, live in the same pubs. So did many others who shall not be named to protect the innocent. Both Matthieu Nicolas and MaReK Krejpsky became friends in the rather short time that they were here. The same is true for Phillipa Wrathall, who definitely did not stay with us long enough. Finally, I should not forget to mention my parents, who have been out of sight for the last three years, but certainly not out of heart. I have now mentioned my parents. Oh, and Greg Larson, who helped me better understand his rather magnificent Radiance software. Cheers, mate! Oh yes, work was sponsored by various institutes, including the European Commission under TMR grant number ERBFMBICT960655, TNO Physics and Electronics Laboratory, The Stichting Nationale Computerfaciliteiten (National Computing Facilities) for the use of supercomputer facilities, with financial support from the Nederlandse Organisatie voor Wetenschappelijk Onderzoek (Netherlands Organisation for Scientific Research). Smashing.

Contents 1 Introduction

1

1.1

Rendering . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

1

1.2

Ray tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.3

Parallel ray tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6

1.4

General remarks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

8

2 Previous work

9

2.1

Demand driven ray tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

9

2.2

Data parallel ray tracing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

13

2.3

Hybrid scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

15

3 Scene analysis

18

3.1

Distribution of data accesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

18

3.2

Temporal characteristics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

21

3.3

Temporal behaviour per ray type . . . . . . . . . . . . . . . . . . . . . . . . . . . .

24

3.4

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

25

4 Hybrid scheduling

28

4.1

Coherence and data pre-selection . . . . . . . . . . . . . . . . . . . . . . . . . . . .

29

4.2

Data parallel component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

4.3

Demand driven component . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

32

4.4

Priority task selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

33

4.5

Data structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

4.6

Example . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

5 Data selection

39

5.1

Bounding pyramid . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

39

5.2

Pyramid-octree intersection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

41

5.3

Ray traversal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

5.4

Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

45

iv

CONTENTS 5.5

v

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

6 Static data distributions

48 49

6.1

Previous work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

50

6.2

Overview method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

51

6.3

Cost functions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

52

6.3.1

Primary ray distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

6.3.2

Secondary and higher order ray distributions . . . . . . . . . . . . . . . . .

54

6.3.3

Intersection tests . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

6.3.4

Time complexity . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

6.4

Data distributions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

56

6.5

Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

57

6.6

Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

6.6.1

Cost function . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

6.6.2

Data distribution algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . .

64

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

67

6.7

7 Task scheduling

69

7.1

Shadow task scheduling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

69

7.2

Caching using octrees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

71

7.3

Data fetching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

72

7.4

Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

75

7.4.1 7.4.2

Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Memory consumption . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

75 79

7.4.3

Pre-fetching for all shadow tasks . . . . . . . . . . . . . . . . . . . . . . . .

80

7.4.4

Pre-fetching for a limited number of tasks . . . . . . . . . . . . . . . . . . .

83

7.4.5

Image size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

87

7.4.6

Scene size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

88

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

94

7.5

8 Data parallel subsystem

96

8.1

Diffuse inter-reflection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

96

8.2

Experiments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

97

8.2.1

Ambient calculations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

97

8.2.2

Ambient cache . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 101

8.2.3

Work-flow control . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

8.2.4

Scheduling methods . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 107

8.3

Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 108

9 Conclusions and further research

110

CONTENTS

vi

Bibliography

113

A Notation and terminology

121

B Radiance internals

126

C Hardware environments

129

D Scene Description

132

List of Tables 4.1

Ray types and their characteristics. . . . . . . . . . . . . . . . . . . . . . . . . . . .

31

4.2

Relative priorities for tasks. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

34

6.1

Octree statistics. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

59

6.2

Cost estimation for primary rays. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

62

6.3

Cost estimation for primary and secondary rays. . . . . . . . . . . . . . . . . . . . .

62

6.4

Raw data for cost estimation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

63

7.1

Maximum amount of dynamically allocated memory. . . . . . . . . . . . . . . . . .

80

7.2 7.3

Scene sizes and efficiency for the colour cube models. . . . . . . . . . . . . . . . . . Scene sizes and efficiency for the 2D sinc models. . . . . . . . . . . . . . . . . . . .

88 90

7.4

Memory used for object storage. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

94

A.1 Representative luminous flux densities. . . . . . . . . . . . . . . . . . . . . . . . . . 123 D.1 Scene statistics (studio, conference room and colour cube). . . . . . . . . . . . . . . 132 D.2 Scene statistics (cube models). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 D.3 Scene statistics (sinc models). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 133 D.4 Objects per processor for all models. . . . . . . . . . . . . . . . . . . . . . . . . . . 133

vii

List of Figures 1.1

Rendering equation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2

1.2

Overview of ray tracing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

3

1.3

Ray tracing example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4

1.4

Spatial subdivision techniques. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5

2.1

Image complexity. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

10

2.2

Demand driven ray tracing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

11

2.3

Image space subdivision. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12

2.4 2.5

Ray coherence. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Data parallel ray tracing. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

12 13

2.6

Data distribution example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

14

3.1

Test scenes: colour cube, studio and conference room. . . . . . . . . . . . . . . . . .

19

3.2

Data accesses sorted by frequency. . . . . . . . . . . . . . . . . . . . . . . . . . . .

20

3.3

Data accesses per ray type (colour cube). . . . . . . . . . . . . . . . . . . . . . . . .

21

3.4

Data accesses per ray type (transparent colour cube). . . . . . . . . . . . . . . . . .

21

3.5

Data accesses per ray type (cube model). . . . . . . . . . . . . . . . . . . . . . . . .

22

3.6

Data accesses per ray type (conference room). . . . . . . . . . . . . . . . . . . . . .

22

3.7

Data accesses per second (colour cube). . . . . . . . . . . . . . . . . . . . . . . . .

23

3.8

Data accesses per second (colour cube; including diffuse inter-reflection). . . . . . .

23

3.9

Data accesses per second (transparent colour cube). . . . . . . . . . . . . . . . . . .

24

3.10 Data accesses per second (transparent colour cube; including diffuse inter-reflection).

24

3.11 Data accesses per second (studio model). . . . . . . . . . . . . . . . . . . . . . . . .

25

3.12 Data accesses per second (studio model; including diffuse inter-reflection). . . . . . .

25

3.13 Data accesses per second for primary and shadow rays (colour cube). . . . . . . . . .

27

3.14 Data accesses per second for primary, shadow and ambient rays (studio model). . . .

27

4.1

Internal data structure. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

35

4.2

Example data distribution. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

36

4.3

Example primary ray task assignment. . . . . . . . . . . . . . . . . . . . . . . . . .

37

4.4

Example shadow task assignment. . . . . . . . . . . . . . . . . . . . . . . . . . . .

38

viii

LIST OF FIGURES

ix

4.5

Example data parallel ray execution. . . . . . . . . . . . . . . . . . . . . . . . . . .

38

5.1

Deriving the plane normals for pyramid clipping. . . . . . . . . . . . . . . . . . . .

40

5.2

Constructing a pyramid around a light source. . . . . . . . . . . . . . . . . . . . . .

41

5.3

Partitioning space into nine subspaces. . . . . . . . . . . . . . . . . . . . . . . . . .

41

5.4

Simplified clipping test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

42

5.5

Testing an edge against four planes. . . . . . . . . . . . . . . . . . . . . . . . . . .

43

5.6

Incrementally computed distances between vertices and planes. . . . . . . . . . . . .

43

5.7

Cliplist ordering. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

5.8

Modified clipping test. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

44

5.9

Ray traversal using a previously generated cliplist. . . . . . . . . . . . . . . . . . . .

45

5.10 Pyramid clipping results. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

46

5.11 Pyramid clipping results (smaller viewing angle). . . . . . . . . . . . . . . . . . . .

47

6.1

Reflection distribution for a voxel. . . . . . . . . . . . . . . . . . . . . . . . . . . .

53

6.2

Primary ray insertion into the estimation process. . . . . . . . . . . . . . . . . . . .

53

6.3

Discretising a sphere. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

54

6.4

Cost estimation for leaf nodes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

55

6.5

Example of splitting algorithm. . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

58

6.6

Test scenes used for data distribution algorithm. . . . . . . . . . . . . . . . . . . . .

60

6.7

Test results for cost function. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

61

6.8

Error source for primary ray estimation. . . . . . . . . . . . . . . . . . . . . . . . .

62

6.9

Cost per processor (counting objects). . . . . . . . . . . . . . . . . . . . . . . . . .

65

6.10 Cost per processor (using cost estimation algorithm). . . . . . . . . . . . . . . . . .

65

6.11 Objects per processor (using object counting). . . . . . . . . . . . . . . . . . . . . .

66

6.12 Objects per processor (using cost estimation). . . . . . . . . . . . . . . . . . . . . .

66

6.13 Cost and object distribution (cubes model). . . . . . . . . . . . . . . . . . . . . . .

67

7.1

Object to task linking (preserving execution order). . . . . . . . . . . . . . . . . . .

73

7.2 7.3

Object to task linking (dependent on data availability). . . . . . . . . . . . . . . . . Task list example. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

73 74

7.4

Task list organisation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

75

7.5

Speed-up for conference and studio models. . . . . . . . . . . . . . . . . . . . . . .

76

7.6

Overhead per processor. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

77

7.7

Number of idle processors per second. . . . . . . . . . . . . . . . . . . . . . . . . .

77

7.8

Communication related overhead per second. . . . . . . . . . . . . . . . . . . . . .

77

7.9

Colour cube C and Sinc C. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

78

7.10 Speed-up for colour cube and sinc models. . . . . . . . . . . . . . . . . . . . . . . .

79

7.11 Dynamic memory allocation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

80

7.12 Rendering time for different cache sizes. . . . . . . . . . . . . . . . . . . . . . . . .

81

LIST OF FIGURES

x

7.13 Ray task messages for different cache sizes. . . . . . . . . . . . . . . . . . . . . . . 7.14 Efficiency per processor for different cache sizes. . . . . . . . . . . . . . . . . . . .

82 83

7.15 Task communication over time for processor 0 (studio model, cache size 400 kB). . .

84

7.16 Number of processors idle per time unit (studio model, cache size 400 kB). . . . . .

84

7.17 Efficiency per second. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

84

7.18 Rendering time for different scheduling techniques. . . . . . . . . . . . . . . . . . .

85

7.19 Cache sizes for different pre-fetch horizons. . . . . . . . . . . . . . . . . . . . . . .

86

7.20 Efficiency per processor for different pre-fetch horizons. . . . . . . . . . . . . . . .

86

7.21 Efficiency and cache size for different image sizes. . . . . . . . . . . . . . . . . . .

88

7.22 Colour cube test scenes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

89

7.23 Sinc function test scenes. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

90

7.24 Overhead as function of scene size. . . . . . . . . . . . . . . . . . . . . . . . . . . .

91

7.25 Cache size as function of scene size (colour cube). . . . . . . . . . . . . . . . . . . .

92

7.26 Memory usage as function of scene size (colour cube). . . . . . . . . . . . . . . . .

92

7.27 Cache size as function of scene size (sinc model). . . . . . . . . . . . . . . . . . . .

93

7.28 Memory usage as function of scene size (sinc model). . . . . . . . . . . . . . . . . .

93

8.1

Total number of data parallel rays. . . . . . . . . . . . . . . . . . . . . . . . . . . .

98

8.2

Total number of task migrations. . . . . . . . . . . . . . . . . . . . . . . . . . . . .

99

8.3

Percentage of migrated ambient rays. . . . . . . . . . . . . . . . . . . . . . . . . . .

99

8.4

Maximum number of stored intersection points. . . . . . . . . . . . . . . . . . . . . 100

8.5

Ambient cache efficiency. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102

8.6

Render times as function of sub-image size. . . . . . . . . . . . . . . . . . . . . . . 104

8.7

Ambient cache efficiency as function of sub-image size. . . . . . . . . . . . . . . . . 104

8.8

Intersection point storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105

8.9

Dynamic memory allocation per processor. . . . . . . . . . . . . . . . . . . . . . . 105

8.10 Intersection point storage per processor. . . . . . . . . . . . . . . . . . . . . . . . . 106 8.11 Intersection point storage per second. . . . . . . . . . . . . . . . . . . . . . . . . . 106 8.12 Number of received data parallel rays. . . . . . . . . . . . . . . . . . . . . . . . . . 107 8.13 Rendering time for different scheduling techniques (including ambient sampling). . . 108 8.14 Intersection points stored for different scheduling techniques. . . . . . . . . . . . . . 108 A.1 CIE photometric curve. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122 B.1 Octree implementation. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128 C.1 Parsytec CC configuration. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

1 Introduction Rendering artificial scenes in a physically correct manner requires a computationally very expensive lighting simulation. In practice, and using todays processing power, such simulations may take between a couple of hours to several days. Unfortunately, real-time rendering appears not yet achievable using current single processor technology, although some efforts suggest that for relatively simple scenes this indeed is an achievable goal [42, 4]. Hence, multi-processor solutions may provide a means to make rendering more practical. In this dissertation, an avenue of research is followed in order to speed-up rendering for complex scenes which incorporates sampling of diffuse inter-reflection.

1.1 Rendering Arguably the best lighting simulation algorithms to date are ray tracing [77], radiosity [17] and twopass algorithms which perform a radiosity pre-processing on the scene and use ray tracing to render the final image [61, 35]. The difference between these algorithms is in which light paths are approximated and which are correctly simulated. This means that the lighting effects obtainable with ray tracing are slightly different from radiosity. This thesis focuses on ray tracing, which is arguably one of the most widely used accurate rendering algorithms, but incorporates sampling of diffuse interreflection, explained below. As pointed out by Kajiya [29], all rendering algorithms aim to model the same lighting behaviour, i.e. light scattering off various types of surfaces, and hence try to solve the same equation, termed the rendering equation. Following the notation adopted by Shirley [61] (see appendix A), the rendering equation is given by:

Lo (x; o ) = Le (x; o ) +

Z

0odA0 v (x; x0 )fr(x; 0o ; o )Lo(x0 ; 0o) cos i cos kx0 , xk2 all x’

This equation states that the outgoing radiance Lo at surface point

(1.1)

x in direction o is equal to the

emitted radiance Le plus the incoming radiance from all points x0 reflected into direction o . In this equation, v (x; x0 ) is a visibility term, being 1 if x0 is visible from surface point x and 0 otherwise. The material properties of surface point

x are represented in the bi-directional reflection distribution

function (BRDF) fr (x; 0o ; o ), which returns the amount of radiance reflected into direction o as function of incident radiance from direction 0 . The cosine terms translate surface points in the scene

o

into projected solid angles. Figure 1.1 shows an example of a surface with a point x on it. The partic-

1

CHAPTER 1. INTRODUCTION

2

vLo Lo (x; o ) fr

x

Surface

Figure 1.1: Rendering equation ular surface material chosen gives rise to the brdf depicted and the geometry surrounding this surface, including lights and reflections off other surfaces leads to a distribution of incoming radiances. In this example, the incoming radiance (as function of direction) shows a number of discontinuities, but is relatively smooth in other areas. Because incoming radiance is smooth in some areas, but behaves erratically in others, properly sampling this function is a difficult issue. Much research in graphics, for example using Monte Carlo techniques, is focussed on adequately sampling this function using a minimum number of samples. The rendering equation is an approximation to Maxwell’s equation for electro-magnetics [29] and therefore does not model all optical phenomena. For example, it does not include diffraction and it also assumes that the medium in between surfaces does not scatter light. This means that participating media, such as smoke, clouds, mist and fire are not accounted for. There are two reasons for the complexity of physically correct rendering algorithms. One stems from the fact that the quantity to be computed,

Lo is part of the integral in equation 1.1, turning the

rendering equation into a recursive integral equation. The other is that, although fixed, the integration domain can be arbitrarily complex (see for example figure 1.1). Recursive integral equations with fixed integration domains are called Fredholm equations of the second kind and have to be solved numerically [12].

1.2 Ray tracing All currently popular rendering algorithms approximate the rendering equation. The differences are in the type of error introduced by the different methods. One such approximation is called ray tracing [77]. The basic ray tracing algorithm follows for each pixel of the image one or more rays into the scene. If such a primary ray hits an object, the light intensity of that object is assigned to the corresponding pixel (figure 1.2). In order to model shadows, from the intersection point of the ray and the object, new rays are spawned towards each of the light sources. These rays, called shadow rays, are used to compute visibility between the intersection point and the light sources.

CHAPTER 1. INTRODUCTION

3

Shadow rays Shadow rays

Screen

Primary ray Eye point

Reflection ray

Figure 1.2: Overview of ray tracing. Mirroring reflection and transparency may be modelled similarly by shooting new rays into the reflected and/or transmitted directions (figure 1.2). These reflection and transparency rays are treated in exactly the same way as primary rays are. Hence, ray tracing is a recursive algorithm. In terms of the rendering equation, ray tracing can be defined more formally as [35]:

Lo(x; o ) = Le (xZ ; o ) + X v(x; xl )fr;d (x)Le (xl; 0o ) cos ld!l + allxi 2L ZL fr;s(x; s ; o )L(xs ; s ) cos s d!s + d La(x)

(1.2)

s 2 s

L.

Here, the second term on the right hand side computes the direct contribution of the light sources The visibility term is evaluated by casting shadow rays towards the light sources. The specular

contribution is computed by evaluating the third term. If the specular component (the same holds for transparency) intersects a surface, this equation is evaluated recursively. As normally no diffuse inter-reflection is computed in ray tracing, the ambient component is approximated by a constant, the fourth term. An example of a surface point being evaluated using ray tracing is given in figure 1.3. When diffuse inter-reflection is sampled, the fourth term in the above equation is replaced by an irradiance calculation (which is defined on a surface as the integral of radiance

L over the projected

hemisphere [30]):

E=

Z 2 Z 0

0

 2

L(; ) cos  sindd

Here, E is the irradiance calculated for a point on a surface and

(1.3)

 is the polar angle from the

CHAPTER 1. INTRODUCTION

4

Light source sampling

Sampling (specular) reflection

Lo (x; o )

Lo (x; o )

x

x

Surface Surface Everything not sampled here is approximated with a constant term.

Figure 1.3: Ray tracing example. The surface point in this example is not self emitting. surface normal and

 is the azimuthal angle from the surface normal.

This irradiance computation

can be approximated using a set of n rays (called ambient rays) which are uniformly distributed over

the projected hemisphere [73, 74]:

q

n X 2n X E  2n2 L(j ; k ) j =1 k=1

(1.4)

j = sin,1 ( j,nX ) and k =  k,nX , and X is a uniform random variable between 0 and 1 and 2n2 is the total number of samples. Because this is a process of undirected shooting, light Where

sources could be hit accidentally. If this happens, the result of that sampling is discarded, since light sources are sampled separately. Because computing

E is an expensive process and generally diffusely reflected energy tends to

vary over a surface with low frequency, diffuse inter-reflection can sometimes be interpolated between previously computed values [74]. This is an optimisation which can potentially save an enormous amount of sampling. Before an irradiance calculation is performed, it is determined whether enough sampling points in the vicinity have been computed previously. If not, in a data structure. Otherwise,

E is approximated.

E is computed and stored

The data structure is an octree, which is separate

from the spatial subdivision structure and it is built on the fly when irradiance values become available. Tracing rays is a recursive process which has to be carried out for each individual pixel separately. A typical image of

10002 pixels tends to cost at least a million primary rays and a multiple of that

in the form of shadow rays and reflection and transparency rays. The most expensive parts of the algorithm are the visibility calculations. For each ray, the object that intersected the ray first, must be determined. To do this, a potentially large number of objects will have to be tested for intersection with each ray. One of the first and arguably one of the most obvious optimisations is to spatially sort the objects as a pre-process, so that for each ray instead of intersecting all the objects in the scene, only a small

CHAPTER 1. INTRODUCTION

a. Grid

5

b. Octree

c. Bintree

Figure 1.4: Spatial subdivision techniques. subset of the objects need to be tested. Sorting techniques of this kind are commonly known as spatial subdivision techniques [15]. The simplest of these is the grid (figure 1.4a), which subdivides the scene into a number cells (or voxels) of equal size. Tracing a ray is now performed in two steps. First each ray intersects a number of cells, which is called ray traversal. This process is cheap because grids are regular structures. In the second step, objects in the cells that are actually traversed are intersected. Once an intersection in one cell is found, subsequent cells are not traversed anymore. The objects in the cells that are not traversed, are not tested at all. Although the grid is simple to implement and cheap to traverse, it does not adapt itself very well to the quirks of the particular model being rendered. Complex models usually concentrate a large number of objects in a few small areas, whereas the rest of the scene is virtually empty. Figure 1.2 is one such example of a complex scene in which a large concentration of objects is used to model the musical equipment and the couches. The floor and the walls, however, are made of just a few objects. Adaptive spatial subdivisions, such as the octree and the bintree (figure 1.4b and c gives 2D examples) are better suited for complex scenes. Being tree structures, space is recursively subdivided into two (bintree) or eight (octree) cells whenever the number of objects in a cell is above a given threshold and the maximum tree depth is not yet reached. The cells are smaller in areas of high object concentration, but the number of objects in each cell should be more or less the same. The cost of intersecting a ray with the objects in a cell is therefore (almost) the same for all cells in the tree. Experiments have shown that as a rule of thumb, the number of cells in a spatial subdivision structure should be of the same order as the number of objects

N

in the scene [53]. Given this

assumption, an upper bound for the cost (in seconds) of tracing a single ray through the scene is for the three spatial subdivision structures derived as follows [56]: Grid The number of grid cells is N , so that in each of the orthogonal directions x, y and z , the p number of cells will be 3 N . A ray travelling linearly through the structure will therefore cost

p

p

T = 3 N (Tcell + Tint ) = O( 3 N ) In this and the following equations Tcell is the time it takes to traverse a single cell and Tint is the time it takes on average to intersect a single object.

CHAPTER 1. INTRODUCTION

6

Bintree Considering a balanced bintree with N leaf cells, the height of the tree will be h, where 2h = N . The number of cells traversed by a single ray is then O(2 h3 ), because every three levels of the bintree constitute a subdivision in x, therefore

y and z directions.

p

The cost of traversal is

p

T = 2 h3 (Tcell + Tint ) = 3 N (Tcell + Tint ) = O( 3 N )

Octree In a balanced octree with N leaf cells, the height is h, where 8h p3 an octree intersects O ( 8h ) = O (2h ) cells:

p

= N . A ray traversing such p

T = 2h (Tcell + Tint ) = 3 N (Tcell + Tint ) = O( 3 N ) Although the asymptotic behaviour of these three spatial subdivision techniques are the same, in practice differences may occur between the grid and the tree structures due to the grid’s inability to adapt to the distribution of data in the scene. Also, in practice many rays will not reach this upper bound because an intersection may occur after a smaller number of cell-traversals. Spatial subdivision techniques have reduced the number of intersection tests for each ray dramatp ically from O (N ) to O ( 3 N ), but a very large number of intersection tests is still required due to the sheer number of rays being traced and due to the complexity of the scenes which has only increased over the years. Other sorting mechanisms that improve the speed of rendering, such as bounding box strategies, exist, but differ only in the fact that objects are now bounded by simple shapes that need not be in a regular structure. This means that bounding spheres or bounding boxes may overlap and may be of arbitrary size. The optimisation is due to the fact that intersecting a ray with such a simple shape is often much cheaper than intersecting with the more complex geometry it encapsulates. Bounding spheres or boxes may be ordered in a hierarchy as well, leading to a tree structure that removes the need to test all the bounding shapes for each ray. Because bounding boxes (and spheres) are quite similar to the spatial subdivision techniques discussed above, their improved adaptability to the scene and their possibly more expensive ray traversal cost being the differences, these techniques are not considered in this thesis any further. The reduction in intersection tests is of the same order as for spatial subdivision techniques. As other optimisations that significantly reduce the time complexity of ray tracing are not imminent, the most viable route to improve execution times is to employ parallel processing.

1.3 Parallel ray tracing Parallel processing offers the potential in reducing computation time by employing more than one processor for solving the problem. A number of issues play an important role that are not present in sequential programming. Apart from correctness and robustness, efficiency and performance are of

CHAPTER 1. INTRODUCTION

7

utmost importance. A choice must be made whether to decompose the algorithm into a number of preferably independent tasks or to decompose the problem domain where each processor will execute the same (sequential) program on a subset of the data. Task management, i.e. the decision mechanism that resolves the question on which processors to execute which tasks, and data management, which decides which data to store with which processor, are closely related issues that cannot be solved separately. Both influence each other and have an impact on issues of load balancing, data and task communication as well as idle time. The most obvious parallel implementation of ray tracing would simply replicate all the data with each processor and subdivide the screen into a number of disjunct regions. Each processor then renders a number of regions using the unaltered sequential version of the ray tracing algorithm, until the whole image is completed. Whenever a processor finishes a region, it asks the master processor for a new task. In terms of parallel processing, this is called the demand driven approach. In computer graphics terms this would be called an image space subdivision. Because communication is only required to distribute tasks and collate results, idle time should be minimal and the speed-up to be expected with this type of parallelism is near linear. Because the algorithm itself is sequential as well, this algorithm falls in the class of embarrassingly parallel algorithms. Unfortunately, the above parallel implementation assumes that the local memory of each processor is large enough to hold the entire scene. If this is the case, then this is also the best possible way to parallelise a ray tracing algorithm. If very large models need to be rendered, or if the complexity of the lighting model increases, the storage requirements will increase in accordance. It may then become impossible to run this embarrassingly parallel algorithm efficiently. Chapter 2 explains the image space subdivision method in greater detail and also details previous attempts to overcome the associated problems, and their main strengths and weaknesses. This thesis is meant as a study of ray tracing algorithms, aiming to discover non-apparent parallelism and employing these insights to obtain an efficient parallel implementation of a ray tracing algorithm. This is in contrast with the usual aims and objectives in parallel processing, which are to develop generally applicable techniques and apply those to a certain application, where ray tracing may be one such application. The idea is that a specialised study of the particular algorithm at hand may lead to a better understanding of what parallelism is available and how best to exploit it. The remaining popularity of ray tracing in general warrants such an approach. Rather than starting a parallel implementation from scratch, all the techniques and methods presented in this thesis are implemented in a well-known ray tracing package called Radiance [72], allowing to build on the expertise of at least ten years of previous research in computer graphics. For the same reason, the resulting parallel implementation should appeal to the users that are already familiar with this package.

CHAPTER 1. INTRODUCTION

8

1.4 General remarks Regarding the experiments carried out in various chapters in this dissertation, the following remarks hold:



A number of different hardware platforms were available for development and experimentation. These are described in appendix C. Unfortunately, not all architectures were available all the time. For this reason, different experiments were executed on different architectures, thereby prohibiting direct comparison between certain tests. However, it is felt that for none of the experiments this has resulted in an inability to draw the conclusions that the experiment set out to demonstrate.



All rendering times are given in seconds, unless otherwise stated. Similarly, all memory consumption is indicated in kilobytes (kB), unless indicated otherwise.



Various different scenes were used for different experiments. Images are included in the main text at the point where they are used for the first time. Appendix D provides an overview of the scenes’ general characteristics.



All results reported in this dissertation were achieved using PVM for handling communication [14]. All hardware platforms run a version of PVM; in the case of the Parsytec PowerXplorer, it was a homogeneous version, called Power-PVM.

Finally, the chapters in this thesis are meant to be fairly independent of each other and it should be possible to read them in non-consecutive order. Every chapter has detailed conclusions at the end, while the final chapter presents the overall conclusions only in broad terms (chapter 9). The work is segmented into an overview of the field of parallel ray tracing (chapter 2), followed by an analysis of data accesses as they occur within a sequential ray tracer (chapter 3). An elaborate description of the main body of work carried out for this dissertation, is given in chapter 4. As this work consists of various parts, each of these is described and assessed in a different chapter. Included are chapters on data selection, necessary to match tasks and data (chapter 5), data distributions (chapter 6), demand driven scheduling (chapter 7) and data parallel scheduling (chapter 8). This dissertation also contains four appendices which provide background information on notation and terminology, internal data structures, hardware environments and statistics regarding the test scenes used.

2 Previous work The object of parallel processing is to find a number of preferably independent tasks and execute these tasks on different processors. Because in ray tracing the computation of one pixel is completely independent of any other pixel, this algorithm lends itself very well to parallel processing. As the data used during the computation is read, but not modified, the data could easily be duplicated across the available processors. This would then lead to the simplest possible parallel implementation of a ray tracing algorithm. The only issue left to be addressed is that of load balancing. Superficially, ray tracing does not seem to present any great difficulties for parallel processing. However, in massively parallel applications, duplicating data across processors is very wasteful and limits the problem size to that of the memory available with each processor. When the scene does not fit into a single processors memory, suddenly the problem of parallelising ray tracing becomes a lot more interesting and the following sections address the issues involved. Three different types of scheduling have been tried on ray tracing, which are the demand driven, the data parallel and the hybrid scheduling approach [50, 7]. They are discussed in sections 2.1 through 2.3.

2.1 Demand driven ray tracing The most obvious parallel implementation of ray tracing would simply replicate all the data with each processor and subdivide the screen into a number of disjunct regions [41, 18, 45, 19, 20, 6, 38, 67, 25] or adaptively subdivide the screen and workload [39, 40]. Each processor then renders a number of regions using the unaltered sequential version of the ray tracing algorithm, until the whole image is completed. Tasks can be distributed before the computation begins [25]. This is sometimes referred to as a data driven approach. Communication is minimal, as only completed sub-images need to be transferred to file. However, load imbalances may occur due to differing complexities associated with different areas of the image (see figure 2.1). To actively balance the workload, tasks may be distributed at run-time by a master processor. Whenever a processor finishes a sub-image, it asks the master processor for a new task (figure 2.2). In terms of parallel processing, this is called the demand driven approach. In computer graphics terms this would usually be called a screen space subdivision. The speed-ups to be expected with this type of parallelism are near linear, as the overhead introduced is minimal. Because the calculation of all pixels are independent, this algorithm falls in the class of embarrassingly parallel algorithms. Communication is generally not a major problem with this type of parallel ray tracing. After 9

CHAPTER 2. PREVIOUS WORK

10

Areas of high complexity

Areas of low complexity

Figure 2.1: Different areas of the image have different complexities. finishing a task, a processor may request a new task from a master processor. This involves sending a message to the master, which in turn will send a message back. The other communication that will occur is that of writing the partial images to either the frame buffer or to a mass storage device. Load balancing is achieved dynamically by only sending new tasks to processors that have just become idle. The biggest problems occur right at the beginning of the computation, where each processor is waiting for a task, and at the end of the computation, when some processors are finishing their tasks while others have already finished. One way of minimising load imbalances would be task stealing, where tasks are migrated from overloaded processors to ones that have just become idle [3]. In order to facilitate load balancing, it would be advantageous if each task would take approximately the same amount of computer cycles. In a screen space subdivision based ray tracer, the complexity of a task depends strongly on the number of objects that are visible in its region (figure 2.1). Various methods exist to balance the workload. The left image in figure 2.3 shows a single task per processor approach. This is likely to suffer from load imbalances as clearly the complexity for each of the tasks is different. The middle image shows a good practical solution by having multiple smaller regions per processor. This is likely to give smaller, but still significant, load imbalances at the end of the computation. Finally, the right image in figure 2.3 shows how each region may be adapted in size to create a roughly similar workload for each of the regions. Profiling by subsampling the image to determine the relative workloads of different areas of the image would be necessary (and may also be used to create a suitable spatial subdivision, should the scene be distributed over the processors [47]). Unfortunately, parallel implementations based on image space subdivisions normally assume that the local memory of each processor is large enough to hold the entire scene. If this is the case, then this is also the best possible way to parallelise a ray tracing algorithm. Shared memory (or virtual shared memory) architectures would best adopt this strategy too, because good speed-ups can be obtained using highly optimised ray tracers [36, 62, 31]. It has the additional advantage that the code hardly

CHAPTER 2. PREVIOUS WORK

11 Slave 1 Slave 2 Slave 3

Slave 4

Master processor Task request Task Pixel data

Figure 2.2: Demand driven ray tracing. Each processor requests a task from the master processor. When the master receives a requests, it sends a task to the requesting processor. After this processor finishes its task, it sends the resulting pixel data to the master for collation and requests a new task. needs any rewriting to go from a sequential to a parallel implementation. However, so far shared memory machines tend to suffer from a memory access bottleneck if too many processors require access to memory at the same time. Hardware scalability is therefore an issue. If very large models need to be rendered on distributed memory machines or on clusters of workstations, or if the complexity of the lighting model increases, the storage requirements will increase accordingly. It may then become impossible to run this embarrassingly parallel algorithm efficiently and other strategies will have to be found. An important consequence is that the scene data will have to be distributed. Data access will incur different costs depending on whether the data is stored locally or with a remote processor. It suddenly becomes very important to store frequently accessed data locally, while less frequently used data may be kept at remote processors. If the above screen space subdivision is to be maintained, caching techniques may be helpful to reduce the number of remote data accesses. The unpredictable nature of data access patterns that ray tracing exhibits, makes cache design a non-trivial task [18, 20]. However, for certain classes of rays, cache design can be made a little easier by exploiting coherence (also called data locality). Different kinds of coherence are distinguished in parallel rendering, the most important of which are: Object coherence Objects consist of separate connected pieces bounded in space and distinct objects are disjoint in space. This is the main form of coherence; the others are derived from object coherence [65]. Spatial subdivision techniques, such as grids, octrees and bintrees directly exploit this form of coherence, which explains their success. Image coherence When a coherent model is projected onto a screen, the resulting image should exhibit local constancy as well. This was effectively exploited in [76].

CHAPTER 2. PREVIOUS WORK

1

2

12

2

2

1

4

3

2 3

3

4

3

2

1

1

1

4

1

3

2

1

4 4

4

2

2

a. Static load balancing

b. Dynamic load balancing

1 3

c. Dynamic load balancing with adaptive regions

Figure 2.3: Image space subdivision for four processors. (a) One subregion per processor. (b) Multiple regions per processor. (c) Multiple regions per processor, but each region should bring about approximately the same workload. Ray coherence Rays that start at the same point and travel into similar directions, are likely to intersect the same objects. An example of ray coherence is given in figure 2.4, where most of the plants do not intersect the viewing frustum. Only a small percentage of the plants in this scene are needed to intersect all of the primary rays drawn into it.

Viewing frustum

Eye Point

Figure 2.4: Ray coherence: the rays depicted intersect only a small number of objects. Unfortunately, ray tracing in itself does not show the required amount of data coherence/locality. One way of regaining data locality is by computation re-ordering, for example exploited by breadthfirst ray tracing techniques [43, 60, 59]. For ray tracing, ray coherence is also easily exploited for bundles of primary rays and bundles of shadow rays (assuming that area light sources are used). It is possible to select the data necessary for all of these rays by intersecting a bounding pyramid with a spatial subdivision structure [46, 78]. The resulting list of voxels can then be communicated to the processor requesting the data. This idea is more fully explained in chapter 4.

CHAPTER 2. PREVIOUS WORK

13

1

2

3 Shadow rays

Intersection Shadow rays

4

6

5

Primary ray Reflection ray Viewing frustum

Spatial subdivision

Figure 2.5: Tracing and shading in a data parallel configuration.

2.2 Data parallel ray tracing A different approach to rendering scenes that do not fit into a single processor’s memory, is called data parallel rendering. In this case, the data is distributed amongst the processors. Each processor will own a subset of the scene database and trace rays only when they pass through its own subspace [8, 33, 34, 5, 18, 47, 26, 63, 44, 68, 69, 32, 54]. If a processor detects an intersection in its own subspace, it will spawn secondary rays as usual. Shading is normally performed by the processor that spawned the ray. In the example in figure 2.5, all primary rays are spawned by processor 6. The primary ray drawn in this image intersects a chair, which is detected by processor

2 and a secondary

reflection ray is spawned, as well as a number of shadow rays. These rays are terminated respectively

by processors 1, 3 and 5. The shading results of these processors are returned to processor 2, which will assemble the results and shade the primary ray. This shading result is subsequently sent back to processor 6, which will eventually write the pixel to screen or file. In order to exploit coherence between data accesses as much as possible, usually some spatial subdivision is used to decide which parts of the scene are stored with which processor. In its simplest form, the data is distributed according to a uniform distribution (see figure 2.6a). Each processor will hold one or more equal sized voxels [8, 44, 68, 69, 54]. Having just one voxel per processor allows the data decomposition to be nicely mapped onto a 2D or 3D grid topology. However, since the number of objects may vary dramatically from voxel to voxel, the cost of tracing a ray through each of these voxels will vary and therefore this approach may lead to severe load imbalances. A second, and more difficult problem to address, is the fact that the number of rays passing through each voxel is likely to vary. Certain parts of the scene attract more rays than other parts. This has mainly to do with the view point and the location of the light sources. Both the variations in cost per ray and the number of rays passing through each voxel indicate that having multiple voxels per processor is a good option, as it is likely to reduce the impact of load imbalances.

CHAPTER 2. PREVIOUS WORK

14

1

2

3

4

5

6

7

8

Grid 1

2

4

5

6

7

8

3

1

Octree 1

3

2

5

6

7

8

6

2

3

4

4 5

7

8

1 Bintree

4

3

2

Figure 2.6: Example data distributions for data parallel ray tracing Another approach is to use a hierarchical spatial subdivision, such as an octree [33, 34, 18, 49], bintree (see figures 2.6b and 2.6c) or hierarchical grids [63] and subdivide the scene according to some cost criterion. Three cost criteria are discussed by Salmon and Goldsmith [57]:



the data should be distributed over the processors such that the computational load generated at each processor is roughly the same.

 

The memory requirements should be similar for all processors as well. Finally, the communication cost incurred by the chosen data distribution should be minimised.

Unfortunately, in practice it is very difficult to meet all three criteria. Therefore, usually a simple criterion is used, such as splitting off subtrees such that the number of objects in each subtree is

CHAPTER 2. PREVIOUS WORK

15

roughly the same. This way at least the cost for tracing a single ray will be the same for all processors. Also, storage requirements are evenly spread across all processors. A method for estimating the cost per ray on a per voxel basis is presented in [56]. Memory permitting, a certain degree of data duplication may be very helpful as a means of reducing load imbalances. For example, data residing near light sources may be duplicated with some or all processors or data from neighbouring processors maybe stored locally [63, 54]. In order to address the second problem, such that each processor will handle roughly the same number of ray tasks, profiling may be used to achieve static load balancing [47, 26]. This method attempts to equalise both the cost per ray and the number of rays over all processors. It is expected to outperform other static load balancing techniques at the cost of an extra pre-processing step. If such a pre-processing step is to be avoided, the load in a data parallel system could also be dynamically balanced. This involves dynamic redistribution of data [11]. The idea is to move data from heavily loaded processors to their neighbours, provided that these have a lighter workload. This could be accomplished by shifting the voxel boundaries. Alternatively, the objects may be randomly distributed over the processors (and thus not according to some spatial subdivision) [32]. A ray will then have to be passed from processor to processor until it has visited all the processors. If the network topology is ring based, communication could be pipelined and remains local. Load balancing can be achieved by simply moving some objects along the pipeline from a heavily loaded processor to a less busy processor. In general, the problem with data redistribution is that data accesses are both highly irregular and unknown; both in space and in time. Tuning such a system is therefore very difficult. If data is redistributed too often, the data communication between processors becomes the dominant factor. If data is not redistributed often enough, a suboptimal load balance is achieved. In summary, data parallel ray tracing systems allow large scenes to be distributed over the processors’ local memories, but tend to suffer from load imbalances; a problem which is difficult to solve either with static or dynamic load balancing schemes. Efficiency thus tends to be low in such systems.

2.3 Hybrid scheduling The challenge in parallel ray tracing is to find algorithms which allow large scenes to be distributed without losing too much efficiency due to load imbalances (data parallel rendering) or communication (demand driven ray tracing). Combining data parallel and demand driven aspects into a single algorithm may lead to implementations with a reasonably small amount of communication and an acceptable load balance. Hybrid scheduling algorithms have both demand driven and data parallel components running on the same set of processors: each processor being capable of handling both types of task [58, 28, 27, 48, 54]. The data parallel part of the algorithm then creates a basic, albeit uneven load. Tasks that are not computationally very intensive but require access to a large amount of data are ideally suited for

CHAPTER 2. PREVIOUS WORK

16

data parallel execution. On the other hand, tasks that require a relatively small amount of data could be handled as demand driven tasks. By assigning demand driven tasks to processors that attract only a few data parallel tasks, the uneven basic load can be balanced. Because it is assumed that these demand driven tasks do not access much data, the communication involved in the assignment of such tasks is kept under control. An object subdivision similar to Green and Paddon’s [18] is presented by Scherson and Caspary [58]: the algorithm has a preprocessing stage in which a hierarchical data structure is built. The objects and the bounding boxes are subdivided over the processors whereas the hierarchical data structure is replicated over all processors. During the rendering phase, two tasks are discerned: demand driven ray traversal and data parallel ray-object intersections. Demand driven processes, which compute the intersection of rays with the hierarchical data structure (which is duplicated with each processor), can be executed on any processor. Data driven processes, which intersect rays with objects, can only be executed with the processor holding the specified object. A similar data plus demand driven approach is presented by Jevans [28]. Again each processor runs two processes, an intersection process operates in demand driven mode and a ray generator process works in data driven mode. Each ray generator is assigned a number of screen pixels. The environment is subdivided into sub-spaces (voxels) and all objects within a voxel are stored with the same processor. However, the voxels are distributed over the processors in random fashion. Also, each processor holds the entire sub-division structure. The ray generator that runs on each processor is assigned a number of screen pixels. For each pixel rays are generated and intersected with the spatial sub-division structure. For all the voxels that the ray intersects, a message is dispatched to the processor holding the object data of that voxel. The intersection process receives these messages which contain the ray data and intersects them with the objects it locally holds. It also performs shading calculations. After a successful intersection, a message is sent back to the ray generator. The algorithm is optimistic in the sense that the generator process assumes that the intersection process of one voxel concludes that no object is intersected. Therefore, the generator process does not wait for the intersection process to finish, but keeps on intersecting the ray with the sub-division structure. Many messages may therefore be sent in vain. To be able to identify and destroy the unwanted intersection requests, all messages carry a time stamp. The ability of demand driven tasks to effectively balance the load depends strongly on the amount of work involved with each task. If the task is too light, then the load may remain unbalanced. As the cost of ray traversal is generally deemed cheap compared with ray-object intersections, the effectiveness of the above split of the algorithm into data parallel and demand driven tasks needs to be questioned. Another hybrid algorithm was proposed by Jansen and Chalmers [27] and Reinhard and Jansen [54]. Rays are classified according to the amount of coherence that they exhibit. If much coherence is present, for example in bundles of primary or shadow rays, these bundles are traced in demand driven mode, one bundle per task. Because the number of rays in each bundle can be controlled, task gran-

CHAPTER 2. PREVIOUS WORK

17

ularity can be increased or decreased when necessary. Normally, it is advantageous to have as many rays in as narrow a bundle as possible. In this case the work load associated with the bundle of rays is high, while the number of objects intersected by the bundle is limited. Task and data communication associated with such a bundle is therefore limited as well. The main data distribution can be according to a grid or octree, where the spatial subdivision structure is replicated over the processors. The spatial subdivision either holds the objects themselves in its voxels, or identification tags indicating which remote processor stores the data for those voxels. If a processor needs access to a part of the spatial subdivision that is not locally available, it reads the identification tag and in the case of data parallel tasks migrates the task at hand to that processor or in the case of demand driven tasks sends a request for data to that processor. This hybrid algorithm can also be interpreted as a data parallel algorithm where the “hot-spots” near viewing point and light sources are off-loaded to processors that are less busy with the basic data parallel load, effectively moving work from busy processors to less busy processors. The data parallel component remains essential for those tasks that require large amounts of data (e.g. texture maps). Hybrid scheduling is the method explored in detail in this thesis. The following chapters first present a much more detailed overview of the coherence based hybrid scheduling algorithm (chapter 4) and then discuss the basic building blocks needed for this algorithm, which include a method to preselect the data required for demand driven tasks (chapter 5) and static data distributions (chapter 6). Then various scheduling methods are explored in chapter 7 as well as the use of data caches and memory management. The data parallel component is presented in more detail in chapter 8. All of these chapters contain experiments assessing the strengths and weaknesses of the subsystems involved and conclusions are drawn accordingly. The final chapter contains overall conclusions and hints as to where future research may be directed.

3 Scene analysis For ray tracing, it is a priori unknown which scene data is going to be accessed nor when and how often. This has profound implications on any strategies aiming to solve for rendering distributed data sets, including hybrid scheduling. In the following sections, the extent of this problem is assessed by gathering statistics from a sequential ray tracer [52]. This should provide insight into which algorithms may be able to cope with highly complex scenes and irregular data access patterns and which algorithms are likely to perform less adequately.

3.1 Distribution of data accesses Given an understanding of the algorithms involved in ray tracing, an intuitive idea may be obtained of the spatial distribution of object accesses. For example, it is to be expected that light sources, and those objects directly in front of light sources, will be queried for intersections much more often than objects located in remote or dark corners. According to similar arguments, the objects that lie within the viewing frustum can be expected to be intersected more often than those outside the viewing frustum. For indirect reflection it is to be expected that object accesses are less coherent, as this involves sampling a hemisphere, rather than a narrow bundle of rays. However, it remains difficult to predict how irregular data accesses are over time and how much more often the most used objects are required compared with the least necessary objects. Nonetheless, these are important notions that may have a direct impact on the efficiency of parallel algorithms. For example, for caching of object data to be useful, it would be advantageous to have a relatively small subset of the scene geometry to be accessed a very large number of times, while the large majority of objects would be required only a couple of times. In order to verify if there is a small selection of objects that are intersected significantly more often than the other objects, a number of images were rendered while at the same time counting the number of intersection tests per object. The test scenes are the colour cube model in both diffuse and transparent versions and the studio model (figure 3.1). These scenes differ in the distribution of their objects over space. The objects in the colour cube model are quite evenly spread over space, leading to a high level of occlusion. The studio model has many objects located in a small part of the scene, with much less occlusion. Images were rendered with and without diffuse inter-reflection. The results of these renderings can be viewed in figure 3.2. In these graphs, the objects are sorted according to frequency of access with the most often accessed objects on the left. The more uneven 18

CHAPTER 3. SCENE ANALYSIS

19

Figure 3.1: Test scenes used for this chapter’s experiments. the distribution of data accesses over the objects, the steeper the average slope of the graph. It appears that there is indeed a large difference between the most often intersected object and a selection of objects that is not queried for intersections at all (eight orders of magnitude in fact). This is according to expectation. However, it should be noted that for the colour cube model more than half of the objects form a

middle category. These are still intersected more than 104 (no indirect sampling) and 105 times (with indirect sampling). The impact of indirect sampling on the distribution of object intersections is fairly small. Object accesses are distributed slightly more evenly, which in figure 3.2 shows up as a graph that tails off less than the graph for rendering without indirect sampling. The following experiments are repetitions of the above ones, but now for different ray types.

CHAPTER 3. SCENE ANALYSIS No ambient

Ambient

1E+09 1E+08 1E+07 1E+06 1E+05 1E+04 1E+03 1E+02 1E+01 1E+00

Accesses

Accesses

Ambient

20

0

2000

4000

6000

8000 10000

0

2000

4000

Objects No ambient

Ambient

1E+08 1E+07 1E+06 1E+05 1E+04 1E+03 1E+02 1E+01 1E+00 0

1000

2000

3000

6000

8000 10000

Objects

Accesses

Accesses

Ambient

No ambient

1E+09 1E+08 1E+07 1E+06 1E+05 1E+04 1E+03 1E+02 1E+01 1E+00

4000

5000

Objects

No ambient

1E+09 1E+08 1E+07 1E+06 1E+05 1E+04 1E+03 1E+02 1E+01 1E+00 0

1000

2000

3000

Objects

Figure 3.2: Data accesses sorted by frequency. Top left: colour cube model. Top right: transparent colour cube. Bottom left: studio model. Bottom right: conference room. Doing so should reveal whether object accesses are more unevenly distributed for certain types of ray. For renderings without diffuse inter-reflection or specular reflection, effectively shooting only primary rays and shadow rays, the slopes for shadow and primary rays are similar, indicating that the distribution of data accesses over the objects is similar (figure 3.3 left). For a non-specular scene with sampling of diffuse inter-reflection, the slope for rays that sample indirect reflection is less steep than for shadow rays and primary rays (figure 3.3 right). This indicates a more even distribution for object intersections caused by indirect reflection rays. This is a direct result of the lack of coherence between ambient rays and again confirms expectations. When a significant amount of transparency and reflection is added to the same scene, the results do not alter drastically (figure 3.4). The object intersections for primary rays are completely unchanged, which is due to the unchanged geometry and view point. The total number of rays is reduced, because diffuse inter-reflection is not sampled for glass objects. However, the pattern of object accesses seems to be largely unaffected. The reflection and refraction rays seem to access most of the objects in a similar manner to shadow and ambient rays. However, for a more realistic scene, such as the studio model, the results are somewhat different (figure 3.5). Here, the amount of reflection and refraction in the scene is relatively small, but a fair amount of specularity is present due to some of the materials used 1 . The fact that for specular 1

In Radiance, a distinction is made between reflection and specularity. Reflection is sampled with a single ray, whereas

CHAPTER 3. SCENE ANALYSIS Primary

1E+08 1E+07 1E+06 1E+05 1E+04 1E+03 1E+02 1E+01 1E+00

Shadow

Accesses

Accesses

Shadow

21

0

2000

4000

6000

Ambient

Primary

1E+09 1E+08 1E+07 1E+06 1E+05 1E+04 1E+03 1E+02 1E+01 1E+00 0

8000 10000

2000

4000

6000

8000 10000

Objects

Objects

Figure 3.3: Data accesses for the colour cube model split according to ray type. The objects are sorted by number of accesses. Renderings were made without diffuse sampling (left) and with diffuse sampling (right). Refracted

Reflected

Shadow Ambient

Primary

Refracted Primary

Reflected

1E+10 1E+08 Accesses

Accesses

Shadow 1E+08 1E+07 1E+06 1E+05 1E+04 1E+03 1E+02 1E+01 1E+00

1E+06 1E+04 1E+02 1E+00

0

2000

4000

6000

8000 10000

Objects

0

2000

4000

6000

8000 10000

Objects

Figure 3.4: Data accesses for the transparent colour cube model split according to ray type. Results of a rendering without diffuse sampling is on the left and with diffuse sampling on the right. reflection significantly fewer objects are accessed, is probably due to the orientation of most objects in the scene and hence can be attributed to a particularity of the scene chosen. However, for shadow, diffuse inter-reflection and primary rays, the sampling pattern is similar to the other scenes.

3.2 Temporal characteristics The previous section has shown that most objects will be accessed during rendering, but it does not show how these accesses are distributed over time. Temporal behaviour is an important issue, as together with the issues discussed in the previous section, it determines the suitability of caching schemes. The more concentrated over time data accesses are, the more successful caching algorithms can be. This section assesses the temporal behaviour of object accesses. The same test scenes as in the previous section were used. For each of the renderings in the previous section, the objects were sorted specular reflection is meant to sample glossy (Gaussian) reflection with a number of rays scattered around the direction of reflection. The same distinction exists between refraction and transparency.

CHAPTER 3. SCENE ANALYSIS Primary

Specular

Trans

Shadow Primary

Reflected

1E+08 1E+07 1E+06 1E+05 1E+04 1E+03 1E+02 1E+01 1E+00

Accesses

Accesses

Shadow

22

0

1000

2000

3000

4000

Ambient Reflected

Specular Trans

1E+08 1E+07 1E+06 1E+05 1E+04 1E+03 1E+02 1E+01 1E+00

5000

0

1000

2000

Objects

3000

4000

5000

Objects

Figure 3.5: Data accesses for the studio model split according to ray type. Left are the results of a rendering without diffuse sampling and with diffuse sampling is on the right. Primary

Reflected

Trans

Shadow Primary

Ambient Trans

Reflected Specular

1E+10 1E+08 Accesses

Accesses

Shadow 1E+08 1E+07 1E+06 1E+05 1E+04 1E+03 1E+02 1E+01 1E+00

1E+06 1E+04 1E+02 1E+00

0

1000

2000 Objects

3000

0

1000

2000

3000

Objects

Figure 3.6: Data accesses for the conference room split according to ray type. Left are the results of a rendering without diffuse sampling and with diffuse sampling is on the right. according to frequency of access. Here, the ten median objects are chosen which belong to neither the group of most accessed objects nor the group of least accessed objects. The number of ray-object intersections per second was recorded and a selection of graphs are shown in figures 3.7 to 3.11. These graphs show that the time between the first and the last data access for a single object is always at least 20% of the total rendering time, and often as much as 60%, even without diffuse inter-reflection. As the ten median objects were chosen for these tests, which can be thought of as representative for most objects in the scene, we deduce that this result extents to the majority of objects in the scene. Adding diffuse inter-reflection tends to destroy temporal coherence for both the colour cube and transparent colour cube models. The studio models shows different behaviour. Here, the number of data accesses for the median objects is very small and temporally coherent (figure 3.11). Adding diffuse inter-reflection, does not significantly alter this behaviour (figure 3.12). The reason appears to be that for this model, the scene is densely populated with objects in a relatively small area, while the rest of the room is empty, except for the walls and the light sources. As the majority of these small objects do not occlude one another, a relatively small number of objects is accessed during rendering.

CHAPTER 3. SCENE ANALYSIS

23

2256

0 0

2000

4000

6000

8000

10000

0

2000

4000

6000

8000

10000

3976

0

Figure 3.7: Data accesses per second for two objects of the colour cube model. Of the ten median objects chosen, the top graph shows the most widespread and the bottom graph shows the most concentrated set of accesses. No diffuse inter-reflection was calculated. The total number of object accesses was 58555 for the top graph and 58737 for the bottom graph. 6803

0 0

2000

4000

6000

8000

10000

12000

Figure 3.8: This graph shows the temporal behaviour for the colour cube model including diffuse inter-reflection. Otherwise, similar to figure 3.7. When adding diffuse inter-reflection, the walls, floor and ceiling are most likely to be hit by diffuse inter-reflection rays. All other objects may receive an occasional diffuse inter-reflection ray. For the conference room, the median objects did not record any intersections: less than half the scene was used for the rendering shown in figure 3.1. This unexpected behaviour is possibly due to the fact that most of these objects are instanced (most notably the chairs), leading to a large amount of visible detail caused by a small number of objects. The geometry which is situated near the walls are not within the viewing frustum and therefore do not tend to be intersected either. Adding diffuse interreflection to this scene, increased the number of accessed objects substantially, although the number of intersections for a large number of objects is still relatively small. In summary, the amount of temporal coherence depends on geometry and the chosen view point. In scenes where the geometry is evenly distributed over space, coherence is least, whereas an uneven

CHAPTER 3. SCENE ANALYSIS

24

1311

0 0

2000

4000

6000

8000

10000

12000

Figure 3.9: Data accesses per second for a characteristic object of the transparent colour cube model. 1598

0 0

2000

4000

6000

8000

10000

12000

Figure 3.10: This graphs shows the temporal behaviour for a characteristic object of the transparent colour cube model including diffuse inter-reflection. Compare with figure 3.9 to see the impact of adding diffuse inter-reflection. spread of objects over the scenes, such as seen in the studio and conference room models, gives rise to much more coherent accesses. Adding diffuse inter-reflection is always detrimental to preserving temporal coherence. For this reason, the next section explores to what extent different ray types (such as primary rays, shadow rays and the like) exhibit coherent object access patterns.

3.3 Temporal behaviour per ray type Although the ray-object intersections per unit of time are spread out, this is not necessarily true for all ray types. Looking for example at primary rays only, the time between first and last access is much shorter (see figure 3.13, where the top graph of figure 3.7 is split into primary and shadow rays). Much of the spread in this rendering is due to shadow rays. This can be attributed to the fact that shadow rays originate from many different intersection points which are located in different places in the scene. By sampling diffuse inter-reflection, temporal coherence is lost (as argued in the previous section). Splitting object accesses per ray type reveals that shadow rays, instead of diffuse rays, account for most of the loss of coherence (figure 3.14). This is attributed to the fact that for every diffuse interreflection ray, a number of shadow rays is shot towards the light sources. The diffuse sampling causes the origins of the shadow rays to be spread over the scene. The transparent colour cube model shows behaviour similar to the colour cube model. As the previous section has already shown that temporal coherence for the studio model is fairly well preserved,

CHAPTER 3. SCENE ANALYSIS

25

443

0 0

1000

2000

3000

Figure 3.11: Data accesses per second for a characteristic object of the studio model. 6189

0 0

1000

2000

3000

4000

5000

Figure 3.12: These graphs show the temporal behaviour for the studio model including diffuse interreflection. Otherwise, similar to figure 3.11. this will also be true for the individual ray types. The same argument holds for the conference room scene.

3.4 Conclusions The tests in this chapter are set up to show a number of issues in parallel rendering. First of all, regardless of the amount of occlusion in a scene, most objects will be intersected at some stage during rendering. Most objects, typically more than half, are accessed fairly often. This has profound implications for parallel rendering. Assuming that these scenes are to be distributed over a number of processors, different scheduling approaches can be expected to perform in different ways. Demand driven rendering, for instance, relies normally on the replication of all objects with each processor. In the case that the scene is too large for this, objects will have to be distributed and caching schemes will have to be employed. As figure 3.2 in the previous section shows, caching would only be partially successful. There is a (small) number of objects in each scene which is intersected by most rays. These would end up in the cache of most processors. However, it is the large number of objects that are still intersected quite often that will cause caches to thrash. It is therefore our opinion that general caching schemes applied to demand driven ray tracing will lead to reduced performance. On the other hand, data parallel approaches, whereby data is distributed from the start, would cope with an even distribution of accesses very well, since a simple object distribution would lead to an even workload. However, the presence of a small group of objects that is accessed a couple of

CHAPTER 3. SCENE ANALYSIS

26

orders of magnitude more often than the other objects, will almost inevitably lead to load imbalances, unless these objects can be identified, for example during a pre-processing step, and are replicated with each processor. However, other than predicting that the light sources will be the most important objects in the scene, by pre-processing automatically identifying which objects are accessed most often, is difficult and for now remains an unsolved problem. When sampling of indirect illumination is required, a measurable difference can be observed in coherence between indirect rays and for example shadow and primary rays. For parallel rendering this means that data parallel approaches are more appropriate for indirect sampling, while demand driven approaches are more suitable for more coherent tasks such as bundles of primary and shadow rays. For caching to be most successful, the time between first and last access, should be as short as possible. However, the temporal behaviour of object accesses seems to indicate that for most objects in the scene, the time between first and last access is substantial. This means that if a caching scheme needs to be employed for these objects, the time such objects need to remain in the cache is long. If the cache is not large enough, then these objects will be repeatedly cleared from the cache and later fetched again. For different ray types, the temporal behaviour is somewhat different. For example, for primary rays, the number of ray-object intersections largely depends on the size of the projection of the object on the image. Small objects therefore tend to be accessed during a small period of time. This suggests that caching would be effective for primary rays. Although shadow rays may originate from any part of the scene, and will do so when diffuse inter-reflection is sampled, coherence between bundles of shadow rays is still preserved. In the case of sampling an area light source, a single intersection point generates a bundle of rays. The number of rays will typically depend on the size of the light source and the distance to the intersection point. The amount of data required to complete sampling such a bundle of rays will depend on these parameters as well, but it is to be expected that such a bundle would require only a small subset of the scene data. The hybrid scheduling algorithm which is explained in the following chapter will rely heavily on it.

CHAPTER 3. SCENE ANALYSIS

27

17

0 0

2000

4000

6000

8000

10000

8000

10000

1684

0 0

2000

4000

6000

Figure 3.13: Data accesses per second for primary (top) and shadow rays (bottom). These graphs are for the same object and rendering as the top graph in figure 3.7. The primary rays account for 244 intersections with this object and 58311 intersections with shadow rays were recorded. 5.0

0.0 0

2000

4000

6000

8000

10000

12000

6529

0 0

2000

4000

6000

8000

10000

12000

0

2000

4000

6000

8000

10000

12000

2370

0

Figure 3.14: The graph of figure 3.8 (colour cube model including diffuse sampling) split by ray type. From top to bottom: primary rays, shadow and diffuse inter-reflection.

4 Hybrid scheduling As already explained in chapter 2, under certain circumstances parallel ray tracing is a relatively simple problem which has been solved. A screen space subdivision where sub-images are handed out on demand by a master processor to a set of identical slaves leads to near perfect speed-ups and a large number of processors can be used before communication overheads begin to dominate. As each processor is responsible for the entire ray trees of a subset of pixels, most of the scene geometry will be accessed by each processor during the computation. The most important disadvantage of demand driven ray tracing is therefore that the scene will have to be replicated with all processors for optimum performance. For relatively small scenes, this is acceptable and hence demand driven scheduling is the preferred method for such models. If, however, the scenes to be rendered are large (defined for argument’s sake as larger than the memory associated with a single processor can accommodate), then this scheduling method will become impractical. Under those circumstances, the scene geometry must be distributed in some way and when a processor requires access to an object that is not locally available, it will have to be fetched from another processor. Without special measures, data fetches will start to dominate the computation. One of the methods to make demand driven scheduling for large scenes more practical, is to use caching mechanisms. These trade memory for a reduction in data communication, but rely on the assumption that a small portion of the data is accessed very often, while the remainder is accessed only occasionally. In other words: the rendering algorithm should exhibit data locality, which for ray tracing is not automatically extracted without modifying the algorithm. Standard caching schemes more often than not do not employ any prior knowledge of the algorithm for which they are employed. Because data accesses in ray tracing are both incoherent and highly variable over time, caching of object data in a demand driven scheduler may not necessarily result in an efficient algorithm. In order to render large scenes and effectively utilise the memory available with each processor, one could distribute the data and allocate tasks with those processors that hold the relevant scene geometry. This is called data parallel rendering and although very large scenes could possibly be rendered, these will not be rendered very efficiently, due to the tendency of tasks to concentrate in certain areas of the scene. As these areas tend to be stored with only a small number of processors, the scalability of this method of scheduling is typically limited to a small number of processors as well. Therefore, it would make sense to investigate possibilities to combine both demand driven and data parallel scheduling into a single hybrid algorithm. This would then provide the opportunity to keep the efficiency of demand driven rendering with the added advantage of being able to render larger 28

CHAPTER 4. HYBRID SCHEDULING

29

scenes. For this approach to be successful, certain criteria should be observed. First, a substantial amount of the work associated with rendering an image should be located in the demand driven component of our hybrid scheduling algorithm because only then will it become possible to use these tasks for load balancing purposes. Additionally, the demand driven tasks should involve as much work per task as possible and operate on as little data as possible. This would limit the amount of data required per demand driven task, so that data fetches for each of these tasks are relatively straightforward. It also minimises the amount of memory required per demand driven task due to the small amount of object data that needs to be cached (at least for the duration of the task). Second, the data parallel component should only be used for those tasks for which it is difficult to predict which data is required, and for those tasks which require a potentially very large amount of scene data. This part of the hybrid scheduling algorithm should be as light as possible, as it tends to account for most load imbalances and potentially suffers from a poor computation to communication ratio. The novelty in our approach to find a hybrid scheduling algorithm that satisfies these criteria, is to look at coherence as a criterion for splitting off demand driven tasks. The following section introduces the concept of coherence with respect to ray tracing and shows how this can be utilised for parallelisation purposes. The resulting tasks can be classified into demand driven and data parallel components, which are discussed in sections 4.2 and 4.3. Section 4.4 explains how these components interact such that the demand driven part balances the potentially uneven load caused by the data parallel component. This chapter concludes with an example.

4.1 Coherence and data pre-selection A basic definition of coherence could be the extent to which an image or an environment is locally constant [18]. Applied to ray tracing, the following types of coherence may be distinguished: Object coherence Objects consist of separate connected pieces bounded in space and distinct objects are disjoint in space. This is the main form of coherence; all others are derived from this [65]. One of the most well known techniques that exploit object coherence directly are spatial subdivision techniques (see chapter 1). These effectively sort space such that the closest intersection is found without querying the entire scene database. Ray coherence Rays which start at the same point and travel into similar directions, are likely to intersect the same objects. An example of ray coherence is given in figure 2.4, where most of the plants do not intersect the viewing frustum. Only a small percentage of the plants in this scene are needed to intersect all of the primary rays drawn into it. Image coherence When a model exhibiting coherence is projected onto a screen, the resulting image should show constancy as well. This was effectively exploited in [76].

CHAPTER 4. HYBRID SCHEDULING

30

Frame coherence In animations, coherence which exists between successive frames could be exploited. In most animations, the appearance of successive frames tends to be similar, so that only small parts of the scene need to be re-rendered. An example of how frame coherence could be utilised is presented in [10]. Data coherence Rendering algorithms tend to access data in a somewhat predictable way [18]. According to Green and Paddon, not only the data items themselves, but also the order in which they are accessed are predictable (or can be made predictable). Unfortunately, we do not share this experience, otherwise we would immediately revert to a demand driven approach using caching techniques (as discussed in the introduction to this chapter). A good example of a sequential memory-coherent ray tracing algorithm is presented in [43]. In this algorithm, the data is not matched to the task at hand, but the tasks are selected according to the scene data that is temporarily loaded from file. This method of computation re-ordering does require, however, a substantial amount of off-line storage for intermediary results, making it a less suitable algorithm for our purposes. As the aim is to split the computation into a demand driven part and a data driven part, it seems that ray coherence may be a valuable tool to explore. In ray tracing, there exist many rays that start from the same point and travel into similar directions. The most obvious class of rays are primary rays. These all start from the view point and travel into the scene until an intersection is found, or the ray leaves the scene (figure 2.4). In addition to primary rays, once an intersection point is found, normally most or all of the light sources are sampled. As it is nowadays reasonable to assume that light sources have a surface area which is larger than zero, a bunch of rays is required to sample each source. As all rays that sample a light source start from the same intersection point and travel towards a light source with finite surface area, they form bundles which exhibit coherence. Because these classes of rays are coherent, the rays within each bundle tend to traverse the same parts of object space and intersect the same objects. The narrower these bundles of rays, the more coherent they are and therefore the fewer objects that are intersected by the bundle. Other classes of rays typically sample specular reflection and refraction and diffuse reflection and refraction. Specular reflection is usually sampled with one, or at most a couple of rays (for glossy reflection) and their direction depends on the surface normal and the angle at which the parent ray hits a surface. Although a small amount of data will be required to complete such a ray task, it is difficult to bundle a number of specular rays together and still have them intersect a small portion of the scene. Diffuse inter-reflection, on the other hand, involves a large number of rays that travel from the same intersection point sampling a complete hemisphere. The bundle of diffuse inter-reflection rays is therefore too divergent and the total amount of data required for the whole set of rays, is considered to be large.

CHAPTER 4. HYBRID SCHEDULING Ray type Primary Shadow Specular reflection Diffuse inter-reflection

Coherence Yes Yes No No

31 Data predictability Good Good Very Good Bad

Amount of data needed per task Small Small Small Large

Table 4.1: Ray types and their characteristics. For those types of ray which exhibit coherence, a good algorithm is required that can determine which data is going to be needed to complete tracing a bundle of rays. In general it is straightforward to predict which data may be needed for single rays. For example data predictability for specular reflection rays can be achieved by traversing the spatial subdivision. Predicting which data is required for bundles of rays, is slightly more difficult, but still feasible. An efficient data pre-selection algorithm for bundles, called pyramid clipping, is presented in chapter 5. Table 4.1 summarises the different types of ray, their potential coherence, data predictability and the amount data that is expected to be needed. Table 4.1 indicates that both bundles of primary rays and shadow rays are good candidates for demand driven tasks. All other ray types are less suitable for demand driven execution. Because more primary and shadow rays are traced during image generation than all other types together, these tasks can indeed be used to effectively balance the workload. Our hybrid scheduling algorithm therefore exploits coherence by splitting off bundles of primary rays and shadow rays as demand driven tasks. The remainder of the rays are considered to be incoherent and are executed in data parallel fashion.

4.2 Data parallel component As it is assumed that the scene to be rendered is large, the data will have to be distributed over the available processors. A ray that is to be executed in data parallel mode will start with the processor that holds that part of object space that contains the ray’s origin. The objects local to that processor are intersected with the ray until either an intersection is found or the ray leaves the processor’s subspace. It is obvious that this local ray-object intersection process should be accelerated using a spatial subdivision just as in sequential ray tracing. Not all previous works reported in literature mention this. When a ray leaves a processor’s local subspace, it is migrated to the processor that holds the next subspace. To minimise communication overheads, this should ideally be a neighbouring processor in the network, but this does not always have to be the case1 . Whenever a processor finds an intersection, secondary rays are spawned which by necessity start at the same processor. Shading is normally performed by the processor that detects an intersection. As secondary rays may have to be traced by other processors, the intersection point must be stored temporarily until all 1

Especially when clusters of workstations are used to render an image, the interconnection topology is an unknown quantity. In this thesis no assumptions are made as to what interconnection network is available. This makes the whole discussion independent of the parallel machine the algorithm is supposed to be executed on.

CHAPTER 4. HYBRID SCHEDULING

32

shading results of secondary rays have been returned. Only then is shading performed and the result is returned to the processor that spawned the ray. This method avoid communication of shading results. Alternatively, shading could be transferred to the processor which permanently stores the intersected object.This approach is preferred when shading involves a large amount of data, which should be distributed over the processors. This may for example happen when large texture maps need to be sampled. If an ambient cache is to be built on the fly, shading should be performed by the processor which stores the intersected object as well. This insures that the cached values for the indirect computation are consistent and efficiently used. The fact that intersection points have to be stored for some time, causes an extra memory overhead with each processor. As data parallel rendering tends to cause processing bottlenecks with a small number of processors, these same processors will also suffer from a memory bottleneck because of this. In order to avoid such memory bottlenecks, normally the number of rays being traced simultaneously should be restricted. Processing bottlenecks with data parallel ray tracing have been reported in literature, but to our knowledge, memory bottlenecks due to hot-spots have not been discussed so far. As in our hybrid algorithm, only a relatively small number of rays are traced in data parallel mode, the impact of both processing and memory bottlenecks are expected to be reduced naturally. However, in the presence of diffuse inter-reflection sampling, the overheads within the data parallel part will gain in significance. Memory issues are discussed in detail in chapter 7, while the processing issues associated with the data parallel component are explored in chapter 8.

4.3 Demand driven component As argued above, the demand driven component will contain tasks of bundles of primary and shadow rays. As ray trees are always terminated with shadow rays (these do not spawn new secondary rays), the number of rays traced as demand driven tasks is always higher than the number of data parallel rays. Hence, hybrid scheduling algorithms partitioned in this way stand a chance of being efficient. When a processor requests a new task from the master processor, a random sub-image is selected and returned to this processor. This helps in distributing the workload more evenly, because some regions of the image tend to involve fewer computations than other regions. The size of the sub-image regulates the amount of work inserted into the system per task. For a shadow ray task, the shadow rays all originate from the same intersection point and travel towards the same light source. The ratio of distance between intersection point and size of the light source determines how many rays are shot and how divergent the bundle of rays is. We assume that on average the bundle of rays contains enough rays to warrant data fetches, yet is narrow enough to not require too many data fetches. This assumption is tested in chapter 7. Before a demand driven task can commence, it should be determined whether all the data required

CHAPTER 4. HYBRID SCHEDULING

33

for the task is present or not. An efficient data selection mechanism is required that selects all the data that could possibly be needed to complete the task (without simply selecting the entire scene, of course). Such an algorithm is called pyramid clipping and it is discussed in detail in chapter 5. This algorithm creates a bounding volume around the bundle of rays and intersects this volume with a spatial subdivision. The result is an ordered list of voxels, all of which need to be present to complete the demand driven task. Whenever a voxel is not locally available, it is requested from the processor which is known to store the relevant data. When the data is received, it is cached locally using a least recently used strategy. In order to have other tasks benefit from recently requested data in a transparent manner, it was deemed best to insert cached data into the spatial subdivision structure. In this way, both other demand driven tasks as well as all data parallel tasks which happen to traverse these cached voxels, can access the data without having to access a separate pool of memory (which is the usual way of implementing a cache). The data structure necessary to maintain cached data is discussed in section 7.2.

4.4 Priority task selection Each processor should be capable of handling a variety of different tasks. The most obvious of these are demand driven tasks and data parallel tasks which it may get offered. Whenever a processor receives a message, it decodes what tasks are contained within and executes those tasks. If there are no more messages, a request for more work is issued and sent to the master processor. Different tasks have been given different priorities, such that the most important ones are selected first by each processor and less important ones are not executed immediately. In order to make the hybrid scheduling algorithm work, some of these relative priorities are important and fixed. Demand driven tasks always have a lower priority than data parallel tasks. This means that if a demand driven task and a data parallel task arrive at a processor at the same time, the data parallel task will be executed first. This ensures that demand driven tasks are only executed when no other work is available. Hence demand driven tasks are used to balance the workload between processors. This is the key to hybrid scheduling. Other tasks are requests for more work, which are given a very high priority so that processors remain idle for as short a period as possible. The same is true for requests for data. When a processor requests some data, it does this because there are no data parallel tasks left with that processor and a demand driven task needs to be executed (for which not all data was locally available). Satisfying data requests quickly therefore reduces idle time. Hence requests for data are sent with a high priority and the returned data has high priority too. In summary, tasks received from other processors are executed with different priorities. The relative priorities are listed in table 4.2. Although no assumptions are made regarding the interconnection topology of the network, it

CHAPTER 4. HYBRID SCHEDULING Priority Highest High Medium Low

34 Task type Request for data Satisfying request for data Request for work Data parallel task Demand driven task

Table 4.2: Relative priorities for tasks. is assumed that the network is relatively slow. It is fair to assume that global communication is characterised by a long start-up latency and a fair through-put. This means that it is advantageous to send fewer messages that are larger, rather then more messages that are smaller. For this reason, tasks are bundled into larger messages. A message can therefore contain many tasks, albeit that all these tasks belong to the same priority level. Whenever a task is generated, it is not sent immediately, but it is buffered instead in a message queue. If the message is large enough, it is send. The maximum size of a message also depends on priority level: more important messages are sent quicker than less important ones. When a processor runs out of work, in addition to requesting more work from the master processor, it also sends its buffered messages that have not been sent yet. This mechanism ensures that no pair of processors can hold messages for each other while also waiting for each other’s messages. Hence, this is a means to avoid deadlock.

4.5 Data structures As the algorithm is based on an existing ray tracer (Radiance [75]), its data structure as detailed in appendix B is modified to allow the objects to be distributed over the processors. The octree is replicated with each processor, as are the object sets. However, the objects themselves are distributed. In order to determine which processor holds data that may be needed for a demand driven task, and to send data parallel tasks to the relevant processors, the object sets are modified to include fields to determine which processor holds the associated objects, as well as flags to indicate whether an object set is cached or not. The resulting spatial subdivision structure is depicted in figure 4.1. The extra flag in the object set stores the processor number which holds the objects for this set. This number is set during start-up and is not altered during the computations. The remainder of the bits in this flag are used to indicate whether this object set is cached or not. When an object set is cached, it is included in a doubly linked list. Adding is done at one end of the list, while freeing cached voxels/object sets is performed by examining the end of the list and removing object sets. Although the cache should work on voxel bounds, it is the objects themselves that are actually cached. The values in the object sets, apart from the list pointers and the flag field, are never changed. Because additionally objects can span several voxels, a mechanism is needed to record for each object by how many voxels it is cached (if at all). In order to accomplish this, each object has an extra field which stores a semaphore. Whenever an object is cached for an object set, its value is increased.

CHAPTER 4. HYBRID SCHEDULING Object sets

{

....

Modifications

5

4

3

2

2 5 6 F N P 3 0

#objects flag next pointer previous pointer task number terminator

Objects ....

Hash table

5 6

....

Octree

35

1

Task list (new data structure)

Figure 4.1: Radiance’s data structure [75], including modifications required for parallel rendering.

Demand driven tasks require a separate data structure for several reasons. The execution of a demand driven task occurs in two distinct stages: fetching data and executing the task using the fetched data. In between fetching data and executing the task, other tasks may be executed. Therefore, while data is being fetched for a task, it is stored in a linked list. New demand driven tasks are appended to the list, so that execution of demand driven tasks is in order of creation. This helps minimising the number of data fetches, because if an object is required for more than one task, maintaining this order of execution assures that the object does not need to be fetched again for subsequent tasks. Each task in the task list has a counter that indicates the number of objects that is requested for this task. Whenever an object is received, the task for which is was fetched is retrieved and its counter is decreased. If the counter reaches zero and the task is at the head of the list, the task may be executed. Finally, to assure that object sets are not freed before they have been used, if during data selection an object set needs to be cached, its task number field is set to the latest task for which it is required (for which each task is given a unique number). This is depicted in figure 4.1, where the object set is not cached until its objects were requested for demand driven task number 3. As object 5 was already requested for task number 1, this object is not requested again. However, its semaphore is incremented. On the other hand, object number 6 was not yet fetched and will now be fetched for task number 3. Also, the object now points to this task. Finally, the object set is updated by modifying the flag field to indicate that it is in the process of being cached. It is also added to the doubly linked list of object sets using the previous and next pointers, and the task number is set to task 3. Whenever too much memory is occupied by cached objects, the doubly linked list of object sets is traversed, starting with the oldest ones. An object set can only be released if the task number for which it was last cached is lower than the currently executed demand driven task. When this occurs, the flag field is updated to reflect the set’s new status. The semaphores of the objects belonging to this set are decremented and each object which has a semaphore that reaches zero, is freed.

CHAPTER 4. HYBRID SCHEDULING

36

4

3

1

Spatial subdivision

2

Figure 4.2: Example of a data distribution for four processors based on an octree.

4.6 Example As the various components of this hybrid scheduling algorithm interact with each other in a non-trivial way, this chapter concludes with an example showing which tasks exist and how they are scheduled and executed by different processors. First of all, the data structure consists of an octree, which is replicated with each processor. An initial data distribution is created based on this octree, an example of a possible data distribution is given in figure 4.2. Whenever a slave is without work, which is detected by an empty input queue, it requests more work from a master processor. If work is available, the master processor will send a new primary ray task to this slave. Assume that processor

3 has requested some work and has received a primary ray

task. It will then select the appropriate voxels needed to render these rays and requests missing data from the other processors. When all the data is received, the primary ray task may be executed, see figure 4.3. Let’s assume that these primary rays intersect cached objects that were originally stored with

processor 4. Now, processor

3 generates secondary rays, whereby the secondary shadow rays are

demand driven tasks and the reflection ray is a data parallel task. Before executing any secondary

rays, the intersection point is stored with processor 3. As discussed before, an alternative strategy is possible whereby the intersection point is transferred to the processor that holds the intersected data permanently, i.e. processor 4. Moving the found intersection point to this processor would be advantageous in case a substantial amount of data is required to perform shading, e.g. texture maps, brdf data, bump maps, radiance caches etc. Communication of such large data structures would not make sense. There are two kinds of secondary rays: shadow ray tasks and reflection/refraction. Shadow ray tasks can be freely scheduled with any processor. One possible way to accomplish this is to send the shadow ray task to the master processor to be re-scheduled. Other scheduling techniques are possible

CHAPTER 4. HYBRID SCHEDULING

37

Viewing frustum Primary ray task

4

3

Spatial subdivision

1

2

View point

Figure 4.3: Primary ray task requested by processor 3. (and are discussed later in this thesis). Suppose the shadow ray task is rescheduled with processor

2 (see figure 4.4), then this processor will fetch the data needed to render these shadow rays.

Once

finished with the shadow ray task, the shading result of all these shadow rays is sent back to the processor that found the intersection point from which these shadow rays started. This intersection point is located with processor 3.

The reflection ray is handled in data parallel mode, because the amount of work involved for a single ray does not warrant any data fetches. This ray starts off at the processor which found the

intersection point (3). Because the data that is first traversed is permanently stored with processor 4,

3 in cached form, this ray is first processed by processor 3, using cached data. If this ray hits an uncached voxel, this ray task is migrated to processor 1, because this

but is still available with processor

processor holds the next data on the ray’s path. Assume that processor

1 finds an intersection for this ray.

tracing algorithm recursively.

1 continues the ray A new intersection point is stored with processor 1. Shadow rays are Now, processor

spawned, which are offered to the master processor for re-scheduling. A new reflection ray is spawned as well, which in turn is executed as a data parallel ray task. Once all these rays have returned shading results, processor

1 shades this intersection point and returns the result for the reflection ray

to processor 3. This example is depicted in figure 4.5.

CHAPTER 4. HYBRID SCHEDULING

38

Viewing frustum

Shadow ray task

4

3

1

Spatial subdivision

2

View point

Figure 4.4: Shadow task executed by processor 2 (which may have occurred because this processor was temporarily out of work). Once completed, the shading result is returned to processor 3.

Viewing frustum

4

Reflection ray task

1

Spatial subdivision

3

2

View point

Figure 4.5: Reflection ray, executed as data parallel task, starts are processor 3 using cached data which was earlier fetched from processor 4. This ray is then transferred to processor 1 where a shading result is computed and returned to processor 3.

5 Data selection Whenever a demand driven task which consists of a coherent bundle of rays is generated, before the task can be executed, the data required to complete the task needs to be fetched. Typically, part of this data is already present locally either in cached form, or as part of the local distribution. Other data will have to be fetched from remote processors. As the data is sorted over space with a spatial subdivision structure, in this case an octree, and the data distribution over the processors is based on this octree as well, the spatial subdivision is used to establish which data needs to be fetched from other processors. A number of different algorithms, such as shaft culling [22], can be used for data selection purposes. Greene published an algorithm for intersecting arbitrary convex polyhedrons with rectangular solids [21]. This algorithm can be used to efficiently cull polygons that are inside a viewing pyramid from an octree spatial subdivision. In order to be able to select the required data, an algorithm called pyramid clipping is used [70, 78], which differs from Greene’s in that ours uses a Cohen-Sutherland clipping test to determine whether a voxel intersects a pyramid. The pyramid clipping algorithm selects the voxels that lie in the path of the bundle of rays that form a demand driven task. In subsequent steps, these voxels can be checked to see whether they are cached or not, and if a voxel does not belong to the local processor, the remote processor which stores this voxel is determined and a request for data is issued. Once all the data for the task is locally available, this task is executed. The pyramid clipping algorithm proceeds in a number of steps, each of which is detailed in this section. First of all, a bounding volume is constructed around the bundle of rays (section 5.1). Then, this pyramid is intersected with the octree (section 5.2). This then results in a sorted list of voxels which will be required to complete the task and determine which data needs to be fetched. However, it is also possible to use this list to speed up ray traversal. This is discussed in section 5.3. Finally, this chapter provides evidence for the efficiency of this algorithm in section 5.4 and gives conclusions in section 5.5.

5.1 Bounding pyramid The pyramid clipping algorithm should be able to determine the voxels intersection for both bundles of primary and shadow rays. As the information available for both ray types is slightly different, the construction of a bounding pyramid proceeds differently for both types of ray. However, once the pyramid is constructed, the pyramid-octree intersection algorithm is the same for both types of task. For primary rays, the pyramid is constructed with its apex located at the view point and the four planes of the pyramid intersect the sides of the sub-image for which this task is created. What is 39

CHAPTER 5. DATA SELECTION

40

~ 1;0 D

~ 1;1 D Plane normals:

~ 0;0 D

Subimage

~ 0;1 D

~ 0;0  D ~ 0;1 N~ 0 = D ~ ~ ~ 1;1 N1 = D0;1  D ~ 1;1  D ~ 1;0 N~ 2 = D ~ ~ ~ 0;0 N3 = D1;0  D

Viewpoint

Figure 5.1: The pyramid’s plane normals are derived from four direction vectors by taking cross products. required, therefore, is four plane equations with the plane normals pointing inwards. These plane normals are derived from four direction vectors, each pointing from the view point to one of the four corner pixels of the sub-image, see figure 5.1. As the view point lies within all four planes, the plane equations are derived from the plane normals by applying the view point to the plane normals. When sampling a light source from an intersection point, the exact directions for the rays are not known. However, the position and the size of the light source is known, and based on this information, a pyramid may be constructed. This is achieved by creating a bounding sphere around the light source. With the intersection point as apex, a bounding pyramid is constructed around this sphere, see

~ and R~ known, first a vector perpendicular figure 5.2. Given the vectors as drawn in this figure, with D

~ is constructed by taking the cross product of D~ and an arbitrary non-null vector which does not to D ~ . The plane normal N~ for one of the planes is now given by: coincide with D ~ ~k N~ 0 = D~ kR~ k + P~0 kD kDk kP~0 k

(5.1)

The second plane normals is then created similarly by first creating a new vector P~1 that is perpen-

~ by taking their cross product and then applying equation 5.1. The third and fourth dicular to P~0 and D

plane normals are computed by applying equation 5.1 to P~2 = ,P~0 and P~3 = ,P~1 respectively. Finally, the plane equations are computed by applying the intersection point to the plane normals.

It should be noted that because a bounding sphere is used as the basis for constructing a pyramid around a light source, unless the light source is a sphere, the pyramid will on average be wider than strictly necessary. This is exacerbated by the particular implementation used for this work, whereby a bounding box is placed around the light source. Based on the longest edge of this box, a sphere is placed around the bounding box. This slightly peculiar implementation was chosen for simplicity, and it is certainly possible to optimise this aspect of the algorithm by constructing a tighter bounding sphere around the light source which would reduce the number of objects within the pyramid. One way to achieve this is mentioned in the following section on page 44

CHAPTER 5. DATA SELECTION

41

R~

P~0

N~



Bounding sphere around light source

~ D

Intersection point

Figure 5.2: For light sources, a bounding sphere is computed first. Then the plane equations are derived from the radius of the sphere and the distance between intersection point and the centre of the sphere. A

B

C

D P F

E

G Origin

H

Figure 5.3: The planes defining the pyramid generate nine subspaces.

5.2 Pyramid-octree intersection Once a pyramid has been constructed, it is intersected with the spatial subdivision. Central to this step is the classification of each voxel by its position with respect to the pyramid. This classification uses a variant of the Cohen-Sutherland line clip algorithm. The octree is recursively tested for intersection with the pyramid. Each level of recursion requires at most two tests to determine whether an intersection occurs between a voxel and the pyramid. The first test examines the position of the vertices of a cell with respect to the planes of the pyramid. The four planes of the pyramid define nine subspaces which may contain one or more of the vertices of the cell, see figure 5.3. The position of the vertices is derived from the distances of the vertices to the planes of the pyramid. The following cases now may occur:



All distances are positive indicating that the vertices are inside sector P (see figure 5.3). This means that the cell is completely contained within the pyramid.

 

Some vertices are inside sector P. The cell will be partially inside the pyramid. The vertices are in three consecutive subspaces, excluding P (for example A-B-C or C-E-H). Now the cell will be completely outside the pyramid.



The vertices are in subspaces located on two sides of the pyramid, for example in sector B and E, or D and H. A second test is required to determine whether an edge intersects the pyramid.

CHAPTER 5. DATA SELECTION

42

b1

b1 a1

a1

edge



b2

a2

edge

a2

pyramid planes

pyramid planes

Angle