# Extended Data Collection: Analysis of Cache Behavior and Performance of Different BVH Memory Layouts for Tracing Incoherent Rays

Technical Report 13rp003-GRIS

A. Schulz<sup>1</sup>, D. Wodniok<sup>2</sup>, S. Widmer<sup>2</sup> and M. Goesele<sup>2</sup>

<sup>1</sup>TU Darmstadt, Germany <sup>2</sup>Graduate School Computational Engineering, TU Darmstadt, Germany

# 1 Introduction

This technical report complements the paper "Analysis of Cache Behavior and Performance of Different BVH Memory Layouts for Tracing Incoherent Rays" by Wodniok et al. published in the proceedings of the "Eurographics Symposium on Parallel Graphics and Visualization" [WSWG13]. Please see this main paper for details. The purpose of this report is to publish the complete data collection the paper is based on using the NVIDIA Kepler architecture plus additional data collected on the NVIDIA Fermi architecture.

# 2 Metrics

Several different metrics are computed using the event counters from CUPTI. Some of these can be found in the CUPTI User's Guide [NVIb] others were deduced from values in Parallel Nsight and reconstructing them with events from CUPTI. The formulas in this section use the event names as defined by CUPTI (refer to [NVIb] for more information). A short explanation of each metric follows:

#### Runtime

Trace kernel runtime in milliseconds, measured using CUPTI's activity API.

### L1 global load hit rate

Percentage of global memory loads that hit in L1. Higher is better.

```
\frac{l1\_global\_load\_hit}{l1\_global\_load\_hit + l1\_global\_load\_miss} \times 100
```

#### L1 global load size

Amount of data transferred by global memory loads. Lower is better.

 $(l1\_global\_load\_hit + l1\_global\_load\_miss) \times 128 bytes$ 

#### L1 global load transactions per request

Number of cache lines read per global memory load request of a warp. Range is [1,32]. If a whole warp executes a 16-byte load it always results in at least 4 transactions because the warp is requesting 512 bytes but the hardware can only load up to 128 bytes in a single transaction. When all threads in a warp access a single 16-byte value, the value is broadcast to up to 8 threads at once resulting in 4 transactions. The number of transactions can also be lower than 4 in the aforementioned cases if a warp contains fewer active threads. Lower is better.

 $\frac{l1\_global\_load\_hit + l1\_global\_load\_miss}{gld\_request}$ 

#### $L1{\leftarrow}L2 \text{ load hit rate}$

Percentage of global memory loads that missed in L1 but hit in L2. Higher is better.

 $\frac{l2\_subp0\_read\_hit\_sectors+l2\_subp1\_read\_hit\_sectors}{l2\_subp0\_read\_sector\_queries+l2\_subp1\_read\_sector\_queries} \times 100$ 

#### $L1{\leftarrow}L2 \text{ load size}$

Amount of data transferred from L2 cache by global memory loads. Lower is better.

(l2\_subp0\_read\_sector\_queries + l2\_subp1\_read\_sector\_queries) × 32 bytes

#### Tex cache hit rate

Percentage of texture memory loads that hit in the texture memory cache. Higher is better.

$$\frac{tex0\_cache\_sector\_queries - tex0\_cache\_sector\_misses}{tex0\_cache\_sector\_queries} \times 100$$

#### Tex load size

Amount of data transferred by texture memory loads from the texture memory cache. Lower is better.

*tex0\_cache\_sector\_queries* × 32 *bytes* 

#### Tex←L2 load hit rate

Percentage of texture memory loads that missed in the L1 cache but hit in the L2 cache. Higher is better.

 $\frac{l2\_subp0\_read\_tex\_hit\_sectors + l2\_subp1\_read\_tex\_hit\_sectors}{l2\_subp0\_read\_tex\_sector\_queries + l2\_subp1\_read\_tex\_sector\_queries} \times 100$ 

#### Tex←L2 load size

Amount of data transferred from the L2 cache by texture memory loads. Lower is better.

 $(l2\_subp0\_read\_tex\_sector\_queries+l2\_subp1\_read\_tex\_sector\_queries) \times 32$  bytes

#### Shared memory load size

Amount of data transferred by shared memory loads. This is actually impossible to compute without explicit knowledge of the kernel's code because the *shared\_load* event increments by 1 regardless of the size of the load instruction used. The reason why we can compute this metric is because all loads are guaranteed to be 8 bytes. Thus the shared memory load size is: *shared\_load*  $\times$  8. Lower is better.

#### Shared memory bank conflicts per request

Shared memory is divided into 32 banks. If threads in a warp access the same bank but with different addresses a bank conflict happens and access is serialized. Lower is better.

 $\frac{l1\_shared\_bank\_conflict}{shared\_load}$ 

### Device memory load size

Amount of data transferred from global/device memory. Lower is better.

(*fb\_subp0\_read\_sectors*+*fb\_subp1\_read\_sectors*) × 32 *bytes* 

#### Instruction replay overhead

Percentage of instructions that were issued due to replaying memory accesses, such as cache misses. Lower is better.

 $\frac{instructions\_issued - instructions\_executed}{instructions\_issued} \times 100$ 

#### IPC

Instructions executed per cycle. The Fermi GPU can issue up to 2 instructions per cycle which means the range for this metrics is [0,2]. Higher is better.

*instructions\_executed num\_multiprocessors* × *elapsed\_clocks* 

#### **SIMD efficiency**

Also called *warp execution efficiency*. Percentage of average active threads per warp to total number of threads in a warp. Higher is better.

 $\frac{\textit{thread\_inst\_executed\_0+thread\_inst\_executed\_1}}{\textit{instructions\_executed \times warp\_size}}$ 

#### **Branch efficiency**

Measures SIMD divergence. Percentage of coherent branches to all branches. Higher is better.

 $\frac{branch-divergent\_branch}{branch} \times 100$ 

#### Achieved occupancy

Percentage of average number of active warps to maximum number of warps supported on a multiprocessor. Higher is better.

$$\frac{active\_warps}{48 \times active\_cycles} \times 100$$



Table 1: Scenes used for benchmarking.

# **3** Scenes

For testing the performance of the different BVH and node layouts four different sceness of varying complexity and with different materials were used (Table 1). The Cornell Box scene contains two spheres, one with a glass material and one with translucent material. The crytek-sponza scene is the improved version by Frank Meinl at Crytek [Sm]. Both the crytek-sponza and san-miguel scenes contain only diffuse materials. The kitchen scene contains a number of objects with glass material.

# 4 Evaluation - Fermi architecture

All experiments were run on a computer equipped with an Intel Core i7-960 CPU clocked at 3.2 GHz, an Nvidia Tesla C2070 GPU, Ubuntu 12.04.1 LTS running Linux kernel version 3.2.0-36-generic as the operating system, GCC 4.6.3, NVIDIA display driver 304.64 and CUDA toolkit 5.0. The Tesla C2070 consists of 15 multi-processors which in turn consist of 32 processors each. Memory-wise it has 6144 MB of global/-texture memory, 16 or 48 KB of shared memory or global memory L1 cache depending

on its runtime configuration and 32768 registers per multi-processor. Size of the L2 cache is 768KB.

### 4.1 Memory access properties

We used micro-benchmarking ( [WPSAM10]) to derive memory access properties. Fetch latency for a global memory load of 4 bytes that hits in L1 is  $\approx$  32 cycles, a hit in L2 is  $\approx$  395 cycles while missing both costs  $\approx$  523 cycles. L1 texture memory cache size is 12KB with a cache line size of 128B. L1 hit latency for reading 4 bytes is  $\approx$  220 cycles, L2 hit latency is  $\approx$  524 cycles and missing both L1 and L2 incurs a latency of  $\approx$  647 cycles.

Figure 1 shows the latency of letting an increasing number of threads in a warp access cached memory locations with stride threadID \* 128*B* and stride threadID \* 132*B*. The first access pattern results in n-way bank conflicts for shared memory and is a worst case for global memory, as each thread reads from a different memory segment leading to complete serialization of the request. Latency of both is the same, as L1 cache and shared memory are the same hardware. Texture memory latency stays constant and starts to be lower than L1 latency as soon as at least 8 different L1 cache lines are accessed. The second access pattern results in no bank conflicts for shared memory but again is a worst case for global memory, as each thread reads from a different memory segment. Texture memory latency behaves the same as before. Thus we can see that texture memory performs equally well for access patterns which are worst for either global or shared memory. Figure 2 shows the broadcasting capabilities of global and texture memory when several threads in a warp read the same 4 byte word. We can see, that global memory needs less transactions than texture memory.

#### 4.2 Baseline

The baseline BVH is laid out in DFS order and stores nodes in AoS format. The AoS node format was chosen because Aila et al. [AL09] are using it in their GPU ray traversal routines which are one of the fastest. Tree nodes are accessed via global memory and geometry via texture memory. The trace kernel was profiled using a path tracer, 1024x768 pixel, 32spp, DFS BVH layout, AoS node layout. Figures 3, 4, 5 and 6 show the runtime behavior (in ms) and GPU metrics (percentage) over all render loop iterations for our test scenes with BVH nodes stored either in global memory (left) or texture memory (right).

### 4.3 **BVH and node layouts**

Tables 2 and 3 show a ranking of all BVH and node layout combinations which were accessed via global memory or texture memory. The ranking is performed w.r.t. the average achieved speedup compared to the DFS layout in the respective memory area. The SWST, TDFS and TBFS layouts require a threshold probability. We have tested a number of different values to find the best performing one. The best threshold is required to perform well for all scenes in our data set so that its performance extends to unknown data sets. We use the sum of the scene runtimes to measure the performance of a threshold and choose the best performing ones. The determined thresholds are stated next to the respective BVH layout names in the tables. Following, we will compare the best performing combinations of threshold, BVH and node layout in each memory area to the other introduced BVH layouts.



Figure 1: **Tesla C2070** - Latency plots of two access patterns which are either the worst case (top) or optimal (bottom) for shared memory compared with the latency of directly hitting in global or texture memory with the same pattern. Both patterns are worst case access patterns for global memory as access has to be serialized, though they hit in cache. In both cases texture memory latency stays constant and performs better than

global memory, as soon as at least 8 different memory segments are accessed.

#### 4.4 Best performing layout

The best performing BVH layout for nodes stored in global memory is the TDFS layout with a threshold of 0.4 in combination with the SoA32\_24 node layout, shown on the left side of the figures 7, 8, 9 and 10. For nodes stored in texture memory a TDFS layout with a threshold of 0.5 and AoS node layout is most beneficial. In global memory we have achieved runtime reduction by 2.8 - 6.7%. In texture memory, we gained  $\approx 5.0 - 17.6\%$  runtime reduction compared to the baseline in global memory. Thus, contrary to [ALK12] our path tracer benefited from using texture memory for loading nodes when run on a Fermi GPU. Also accessing the baseline in texture memory, an improvement of only  $\approx 2.3\%$  was observable in the san-miguel scene for treelet based layouts. We attribute the smaller amount of data transferred when using global memory to superior broadcast capabilities (see Section 4.1).

We have also tried to leverage the unused shared memory by using it as a static cache for a part of the BVH but were unable to achieve any advantages over using only



Figure 2: **Tesla C2070** - Number of transactions for a broadcast in global and texture memory for an increasing number of threads in a warp.



Figure 3: **Tesla C2070 - Crytek Sponza -** Baseline trace kernel profiling graph. Nodes are either stored in global (left) or texture memory (right).

a single memory area.



Figure 4: **Tesla C2070** - **Kitchen** - Baseline trace kernel profiling graph. Nodes are either stored in global (left) or texture memory (right).



Figure 5: **Tesla C2070** - **Hairball - Glass** - Baseline trace kernel profiling graph. Nodes are either stored in global (left) or texture memory (right).



Figure 6: **Tesla C2070 - San Miguel -** Baseline trace kernel profiling graph. Nodes are either stored in global (left) or texture memory (right).



Figure 7: **Tesla C2070** - **Crytek Sponza** - Trace kernel profiling graph for the best layouts for BVH nodes stored in global memory (left, TDFS 0.4, SoA32\_24) and stored in texture memory (right, TDFS 0.5, AoS).

|          | SZ        | 886.0    | 886.3    | 884.6    | 883.9    | 882.2    | 889.1    | 880.2    | 893.6  | 892.4   | 892.3   | 890.4   | 870.0   | 875.0   | 867.0   | 886.1   |
|----------|-----------|----------|----------|----------|----------|----------|----------|----------|--------|---------|---------|---------|---------|---------|---------|---------|
| 1-miøue  | H         | 68.9     | 68.8     | 68.0     | 67.5     | 67.2     | 66.3     | 63.3     | 85.0   | 84.4    | 83.9    | 82.9    | 44.4    | 44.2    | 39.5    | 40.0    |
| Sat      | R         | 2872.7   | 2898.1   | 2912.0   | 2910.4   | 2920.0   | 2980.4   | 3039.1   | 3021.7 | 3043.8  | 3053.4  | 3099.9  | 3546.9  | 3567.8  | 3759.5  | 3883.9  |
| SS       | SZ        | 2534.7   | 2534.1   | 2532.1   | 2536.9   | 2529.9   | 2535.1   | 2525.8   | 2554.5 | 2554.7  | 2556.1  | 2554.5  | 2498.0  | 2511.0  | 2486.0  | 2524.3  |
| hall-øla | H H       | 68.7     | 68.8     | 67.2     | 65.6     | 6.99     | 67.3     | 62.0     | 84.7   | 83.8    | 83.4    | 82.2    | 45.1    | 44.7    | 36.2    | 43.2    |
| hair     | R         | 9492.3   | 9489.5   | 9597.6   | 9786.7   | 9608.1   | 9606.3   | 10077.6  | 9888.7 | 10037.6 | 10068.9 | 10278.6 | 11270.4 | 11360.6 | 12235.8 | 11743.0 |
|          | SZ        | 698.6    | 700.0    | 693.3    | 688.7    | 693.0    | 704.4    | 678.8    | 708.2  | 704.7   | 704.4   | 698.2   | 674.3   | 676.1   | 653.7   | 0.969   |
| citchen  | H         | 84.0     | 82.6     | 82.8     | 83.0     | 81.5     | 80.4     | 78.5     | 90.6   | 90.7    | 90.4    | 89.7    | 67.3    | 69.0    | 63.2    | 61.8    |
|          | R         | 1868.5   | 1900.2   | 1886.5   | 1873.4   | 1908.4   | 1951.4   | 1944.9   | 2004.4 | 1999.8  | 2009.0  | 2026.1  | 2106.7  | 2094.6  | 2137.9  | 2267.3  |
| IZa      | SZ        | 310.4    | 311.3    | 311.1    | 312.2    | 311.4    | 313.2    | 311.9    | 315.8  | 315.7   | 316.0   | 316.3   | 304.4   | 304.8   | 305.9   | 310.1   |
| ek-snon  | H         | 85.0     | 84.6     | 84.3     | 84.2     | 84.3     | 83.3     | 82.7     | 92.1   | 92.0    | 91.8    | 91.4    | 72.8    | 72.9    | 70.7    | 68.2    |
| crvt     | R         | 877.7    | 882.5    | 885.9    | 886.5    | 884.9    | 897.6    | 902.2    | 903.1  | 905.6   | 0.006   | 917.8   | 967.1   | 969.3   | 992.0   | 1022.5  |
|          | node lay. | Soa32_24 | Aos    | Aos     | Aos     | Aos     | Soa16_8 | Soa16_8 | Soa16_8 | Soa16_8 |
|          | BVH lay.  | TDFS 0.4 | SWST 0.5 | COL      | TBFS 0.5 | vEB      | DFS      | BFS      | DFS    | COL     | vEB     | BFS     | vEB     | COL     | BFS     | DFS     |

Table 2: Ranking Tesla C2070 - BVH (global memory) and node layout combinations sorted ascending by average speedup over all scenes accessed via global memory. Runtime in ms, Hitrate in percent and SZ denotes the total amount of data transferred (in GB).

|      |            | Z          | 3.3      | 3.2      | 3.2    | 3.3    | 3.4    | 4.0    | 3.6      | 3.7      | 4.0      | 4.7      | 0.5     | 1.0     | 9.7     | 2.6       |
|------|------------|------------|----------|----------|--------|--------|--------|--------|----------|----------|----------|----------|---------|---------|---------|-----------|
|      | san-miguel | <u>s</u>   | 99       | 99       | 99     | 99     | 99     | 99     | 99       | 99       | 99       | 99       | 99      | 99      | 65      | <u>66</u> |
|      |            | Η          | 37.7     | 37.8     | 37.8   | 37.7   | 37.6   | 37.5   | 32.1     | 31.9     | 31.8     | 31.6     | 7.7     | 7.7     | 7.3     | 7.1       |
|      |            | R          | 2489.6   | 2494.2   | 2494.2 | 2499.4 | 2514.7 | 2549.0 | 2667.9   | 2680.6   | 2712.9   | 2768.0   | 3162.2  | 3190.4  | 3186.7  | 3358.6    |
|      | SS         | SZ         | 2200.9   | 2200.6   | 2200.4 | 2200.6 | 2200.7 | 2200.3 | 2201.0   | 2201.2   | 2201.4   | 2201.6   | 2192.7  | 2192.3  | 2191.7  | 2194.3    |
|      | ball-gla   | Η          | 40.1     | 40.4     | 40.4   | 40.1   | 39.9   | 40.1   | 33.4     | 33.0     | 33.2     | 32.9     | 13.3    | 14.0    | 13.9    | 12.1      |
|      | hair       | R          | 9391.0   | 9392.4   | 9386.1 | 9377.7 | 9391.2 | 9393.3 | 9630.3   | 9621.6   | 9642.7   | 9658.5   | 10423.1 | 10479.7 | 10603.3 | 10540.0   |
| TMem | kitchen    | SZ         | 559.3    | 559.2    | 559.3  | 559.3  | 559.3  | 559.3  | 559.5    | 559.5    | 559.6    | 559.6    | 556.6   | 556.8   | 553.1   | 559.0     |
|      |            | Η          | 52.1     | 52.1     | 52.1   | 51.7   | 51.9   | 51.5   | 44.9     | 44.3     | 44.4     | 43.6     | 20.9    | 21.5    | 21.3    | 19.9      |
|      |            | R          | 1726.9   | 1726.9   | 1728.2 | 1727.8 | 1729.0 | 1730.8 | 1754.3   | 1756.1   | 1757.2   | 1771.3   | 1846.6  | 1844.3  | 1849.8  | 1904.7    |
|      | nza        | SZ         | 270.3    | 270.2    | 270.2  | 270.3  | 270.2  | 270.2  | 270.4    | 270.3    | 270.4    | 270.3    | 269.0   | 269.0   | 269.1   | 269.2     |
|      | ek-spo     | Η          | 68.0     | 67.8     | 67.6   | 67.8   | 67.8   | 67.7   | 62.5     | 62.6     | 62.6     | 62.3     | 48.7    | 48.9    | 48.0    | 47.6      |
|      | cry        | R          | 832.1    | 832.5    | 832.6  | 832.4  | 833.0  | 833.6  | 848.1    | 847.7    | 848.8    | 852.6    | 884.9   | 886.5   | 889.2   | 904.3     |
|      | node law   | noue lay.  | Aos      | Aos      | Aos    | Aos    | Aos    | Aos    | Soa32_24 | Soa32_24 | Soa32_24 | Soa32_24 | Soa16_8 | Soa16_8 | Soa16_8 | Soa16_8   |
|      | DV/U low   | D VII IAY. | TDFS 0.5 | TBFS 0.1 | BFS    | vEB    | COL    | DFS    | BFS      | vEB      | COL      | DFS      | vEB     | COL     | BFS     | DFS       |

| a C2070 - BVH (texture memory) and node layout combinations sorted ascending by average speedup over all scenes accessed | intime in ms, Hitrate in percent and $SZ$ denotes the total amount of data transferred (in GB). |
|--------------------------------------------------------------------------------------------------------------------------|-------------------------------------------------------------------------------------------------|
| a C2                                                                                                                     | ntim                                                                                            |
| Tesl                                                                                                                     | . <b>R</b> u                                                                                    |
| cing                                                                                                                     | mory                                                                                            |
| Rank                                                                                                                     | l mei                                                                                           |
| 3: <b>]</b>                                                                                                              | loba                                                                                            |
| Table                                                                                                                    | via gl                                                                                          |



Figure 8: **Tesla C2070** - **Kitchen** - Trace kernel profiling graph for the best layouts for BVH nodes stored in global memory (left, TDFS 0.4, SoA32\_24) and stored in texture memory (right, TDFS 0.5, AoS).



Figure 9: **Tesla C2070** - **Hairball** - **Glass** - Trace kernel profiling graph for the best layouts for BVH nodes stored in global memory (left, TDFS 0.4, SoA32\_24) and stored in texture memory (right, TDFS 0.5, AoS).



Figure 10: **Tesla C2070 - San Miguel** - Trace kernel profiling graph for the best layouts for BVH nodes stored in global memory (left, TDFS 0.4, SoA32\_24) and stored in texture memory (right, TDFS 0.5, AoS).

# **5** Evaluation - Kepler architecture

All experiments were performed using the system described in Section 4 but equipped with a Geforce GTX 680 GPU instead of a Tesla C2070. The Geforce GTX 680 consists of eight Streaming Multiprocessors (SMX) with 192 CUDA cores each. It provides 2048 MB of global/texture memory, 16, 32 or 48 KB of shared memory or L1 cache for local memory (depending on the runtime configuration) and 65536 registers per SMX. Accesses to global memory bypass the L1 cache and directly go through the 512KB L2 cache.

### 5.1 Memory access properties

Again we used micro-benchmarking ( [WPSAM10]) to derive memory access properties. L2 hit latency is  $\approx 160$  cycles while a miss costs 290 cycles. L1 texture memory cache size is 12KB with a cache line size of 128B (see 11). L1 hit latency for reading 4 bytes is  $\approx 105$  cycles, L2 hit latency is  $\approx 266$  cycles and missing both L1 and L2 incurs a latency of  $\approx 350$  cycles.

Figure 12 shows the latency of letting an increasing number of threads in a warp access cached memory locations with stride threadID \* 128*B* and stride threadID \* 132*B*. The first access pattern results in n-way bank conflicts for shared memory and is a worst case for global memory, as each thread reads from a different memory segment leading to complete serialization of the request. Texture memory latency stays constant and is lower than global memory access latency. The second access pattern results in no bank conflicts for shared memory but again is a worst case for global memory, as each thread reads from a different results in no bank conflicts for shared memory but again is a worst case for global memory, as each thread reads from a different memory segment. Texture memory latency behaves the same as before. Thus we can see that texture memory performs equally well for access patterns which are worst for either global or shared memory. Figure 13 shows the broadcasting capabilities of global and texture memory when several threads in a warp read the same 4 byte word. We can see, that global memory needs less transactions than texture memory.



Figure 11: **Geforce GTX 680** - L1 texture cache latency plot indicating that cache size is 12KB with a cache line size of 128B. There are 4 cache sets with 24-way set associativity.

13



Figure 12: **Geforce GTX 680** - Latency plots of two access patterns which are either the worst case (top) or optimal (bottom) for shared memory compared with the latency of directly hitting in global or texture memory with the same pattern. Both patterns are worst case access patterns for global memory as access has to be serialized, though they hit in cache. Texture memory performs equally well in both cases.

# 5.2 Baseline

We chose the same baseline setup as in Section 4.2. Figures 14, 15, 16 and 17 show the runtime behavior (in ms) and GPU metrics (percentage) over all iterations for our test scenes with BVH nodes stored in global memory (left) and stored in texture memory (right).

### 5.3 BVH and node layouts

Tables 4 and 5 show a ranking of all BVH and node layout combinations which were accessed via global memory or texture memory. Thresholds for the SWST, TDFS and TBFS layouts were determined in the same manner as described in Section 4.3.

### 5.4 Profiling stats

Tables 6, 7, 8 and 9 illustrate the changes of the GPU metrics from the baseline measurements with DFS layout to the measurements with the best performing layout com-

|      | ·        |           |          |        |          |        |        |        |          |          |          |          | ·        | ·       | ·       | ·       |         |
|------|----------|-----------|----------|--------|----------|--------|--------|--------|----------|----------|----------|----------|----------|---------|---------|---------|---------|
|      | -        | SZ        | 287.4    | 287.6  | 287.7    | 287.7  | 287.7  | 287.9  | 288.0    | 287.4    | 287.6    | 287.7    | 288.2    | 285.3   | 286.1   | 286.3   | 287.1   |
|      | n-migue  | Η         | 71.4     | 70.7   | 70.8     | 70.7   | 70.9   | 70.6   | 71.0     | 64.4     | 64.6     | 64.9     | 63.3     | 48.3    | 48.6    | 49.1    | 45.4    |
|      | Sal      | R         | 2071.4   | 2131.1 | 2142.8   | 2165.4 | 2166.9 | 2205.3 | 2267.9   | 2683.3   | 2824.0   | 2859.4   | 2932.9   | 3270.6  | 3376.6  | 3437.5  | 3667.2  |
|      | s        | SZ        | 948.8    | 948.8  | 948.8    | 948.6  | 949.1  | 949.1  | 949.3    | 950.3    | 950.1    | 950.7    | 950.7    | 942.2   | 942.8   | 943.4   | 943.0   |
|      | all-glas | Η         | 60.5     | 60.4   | 60.4     | 60.6   | 60.6   | 60.7   | 60.8     | 55.3     | 55.7     | 55.4     | 54.8     | 38.1    | 40.0    | 39.6    | 38.3    |
|      | hairt    | R         | 7504.9   | 7462.2 | 7469.5   | 7469.4 | 7539.5 | 7576.0 | 7638.8   | 9099.2   | 9059.4   | 9269.1   | 9320.0   | 10747.6 | 10555.1 | 10780.1 | 10969.3 |
| -    |          | SZ        | 218.6    | 218.6  | 218.6    | 218.6  | 218.6  | 218.6  | 218.6    | 217.8    | 217.9    | 217.9    | 217.9    | 213.4   | 215.3   | 215.3   | 217.1   |
| GMen | citchen  | Η         | 89.1     | 89.0   | 89.1     | 89.0   | 89.1   | 89.0   | 89.1     | 88.4     | 88.7     | 88.8     | 88.3     | 81.7    | 83.4    | 84.0    | 82.0    |
|      |          | R         | 1386.2   | 1364.6 | 1371.8   | 1374.6 | 1385.5 | 1394.9 | 1391.9   | 1310.1   | 1355.1   | 1335.2   | 1357.9   | 1346.1  | 1364.4  | 1371.4  | 1436.5  |
|      | za       | SZ        | 94.3     | 94.3   | 94.3     | 94.3   | 94.3   | 94.3   | 94.3     | 94.1     | 94.2     | 94.2     | 94.2     | 93.4    | 93.3    | 93.2    | 93.6    |
|      | ek-spon  | Н         | 86.7     | 86.5   | 86.6     | 86.6   | 86.7   | 86.6   | 86.8     | 85.1     | 85.3     | 85.4     | 84.6     | 77.2    | 9.77    | 77.8    | 75.8    |
|      | cryt     | R         | 581.7    | 583.3  | 582.6    | 582.5  | 582.5  | 583.5  | 582.2    | 581.0    | 583.9    | 585.1    | 595.3    | 637.4   | 641.3   | 651.3   | 680.1   |
|      | to low   | noue lay. | Aos      | Aos    | Aos      | Aos    | Aos    | Aos    | Aos      | Soa32_24 | Soa32_24 | Soa32_24 | Soa32_24 | Soa16_8 | Soa16_8 | Soa16_8 | Soa16_8 |
|      | DV/U low | DVD IAY.  | TDFS 0.6 | BFS    | TBFS 0.3 | vEB    | COL    | DFS    | SWST 0.5 | BFS      | vEB      | COL      | DFS      | BFS     | vEB     | COL     | DFS     |

Table 4: Ranking Geforce GTX 680 - BVH (global memory) and node layout combinations sorted ascending by average speedup over all scenes accessed via global memory. Runtime in ms, Hitrate in percent and SZ denotes the total amount of data transferred (in GB).

|      | san-miguel | SZ          | 337.3    | 337.3  | 337.3    | 337.3  | 337.3  | 337.2    | 337.3  | 337.3    | 337.4    | 337.4    | 337.3    | 335.0   | 335.3   | 335.4   | 335.9   |
|------|------------|-------------|----------|--------|----------|--------|--------|----------|--------|----------|----------|----------|----------|---------|---------|---------|---------|
|      |            | Η           | 61.0     | 61.2   | 61.2     | 61.1   | 61.0   | 60.9     | 60.9   | 56.4     | 56.2     | 56.2     | 55.9     | 34.7    | 35.2    | 35.3    | 35.1    |
|      |            | R           | 1300.4   | 1315.0 | 1315.0   | 1326.4 | 1334.8 | 1353.1   | 1356.4 | 1877.0   | 1955.8   | 1978.8   | 2023.4   | 2837.8  | 2932.9  | 2999.3  | 3229.9  |
|      | SS         | SZ          | 1053.8   | 1053.8 | 1053.8   | 1053.8 | 1053.8 | 1053.8   | 1053.8 | 1056.5   | 1056.6   | 1056.5   | 1056.5   | 1048.7  | 1048.9  | 1048.6  | 1049.5  |
|      | rball-gla  | Η           | 59.8     | 59.9   | 59.9     | 59.8   | 59.8   | 59.9     | 59.8   | 56.6     | 56.4     | 56.3     | 56.3     | 36.7    | 37.8    | 37.7    | 38.4    |
|      | hair       | R           | 5369.4   | 5356.7 | 5359.3   | 5357.3 | 5386.1 | 5394.3   | 5394.9 | 6868.8   | 6839.6   | 6956.5   | 6971.4   | 9570.8  | 9261.5  | 9515.5  | 9663.3  |
|      |            | SZ          | 268.1    | 268.1  | 268.1    | 268.1  | 268.1  | 268.1    | 268.1  | 268.3    | 268.3    | 268.3    | 268.3    | 264.8   | 266.5   | 266.6   | 267.6   |
| TMem | citchen    | Η           | 65.6     | 65.6   | 65.7     | 65.5   | 65.6   | 65.5     | 65.4   | 61.5     | 61.1     | 61.2     | 60.7     | 43.9    | 43.9    | 44.2    | 42.9    |
|      | -4         | R           | 812.6    | 807.7  | 810.2    | 806.7  | 804.2  | 805.1    | 806.2  | 845.6    | 837.4    | 852.9    | 852.7    | 988.0   | 981.6   | 973.1   | 1042.8  |
|      | nza        | SZ          | 136.3    | 136.3  | 136.3    | 136.3  | 136.3  | 136.3    | 136.3  | 136.4    | 136.4    | 136.4    | 136.4    | 135.7   | 135.6   | 135.6   | 135.7   |
|      | tek-spoi   | Η           | 76.5     | 76.4   | 76.5     | 76.5   | 76.5   | 76.5     | 76.4   | 73.0     | 73.0     | 73.0     | 72.7     | 60.2    | 61.1    | 61.2    | 60.5    |
|      | cryt       | R           | 372.4    | 373.1  | 373.1    | 373.0  | 373.4  | 373.8    | 374.2  | 412.9    | 417.0    | 417.1    | 423.4    | 497.0   | 495.3   | 506.1   | 535.7   |
|      | node lou   | IIOUC IAY.  | Aos      | Aos    | Aos      | Aos    | Aos    | Aos      | Aos    | Soa32_24 | Soa32_24 | Soa32_24 | Soa32_24 | Soa16_8 | Soa16_8 | Soa16_8 | Soa16_8 |
| -    | BV/H 1av   | D V II Idy. | TDFS 0.6 | BFS    | TBFS 0.2 | vEB    | COL    | SWST 0.4 | DFS    | BFS      | vEB      | COL      | DFS      | BFS     | vEB     | COL     | DFS     |

Table 5: Ranking Geforce GTX 680 - BVH (texture memory) and node layout combinations sorted ascending by average speedup over all scenes accessed via global memory. Runtime in ms, Hitrate in percent and SZ denotes the total amount of data transferred (in GB).



Figure 13: **Geforce GTX 680** - Number of transactions for a broadcast in global and texture memory for an increasing number of threads in a warp.



Figure 14: **Geforce GTX 680 - Crytek Sponza -** Baseline trace kernel profiling graph. Nodes are either stored in global (left) or texture memory (right).

bination. Cells with four values separated by slashes represent minimum, average, maximum and average absolute deviation of the respective metric over profiled iterations.

# 5.5 Best performing layout

For both, storing nodes in global and texture memory the best performing BVH layout is the TDFS layout with a threshold of 0.6 in combination with the AoS node layout. Agreeing with [ALK12] storing nodes in texture memory is most beneficial for Kepler GPUs. Comparing the runtime of the best layout combinations in texture and global memory, we get the same qualitative behavior as for the Fermi GPU in Section 4.4. In global memory we have achieved runtime reduction by 1% - 6%. In texture memory, we gained  $\approx 30.0\% - 40\%$  runtime reduction compared to the baseline in global memory. Also accessing the baseline in texture memory, an improvement of only 0.5% - 4.0% was observable for TDFS. We attribute the smaller amount of data transferred when using global memory to superior broadcast capabilities (see Section 5.1).

### 5.6 Tesla K20C Addendum

According to [NVIa] NVIDIA Kepler GPUs with compute capability 3.5 feature a 48KB read-only data cache per SMX, which is the same as the texture cache. In order to see the effects of a much larger texture cache we also performed our experiments with a Tesla K20c GPU. Figures 22, 23, 24 and 25 show results for our baseline layouts accessing geometry and nodes via the read-only data cache compared with the baseline results for the GTX 680. ECC has been turned on for these experiments. On average



Figure 15: Geforce GTX 680 - Kitchen - Baseline trace kernel profiling graph. Nodes are either stored in global (left) or texture memory (right).



Figure 16: **Geforce GTX 680 - Hairball - Glass -** Baseline trace kernel profiling graph. Nodes are either stored in global (left) or texture memory (right).

runtime is only about 8% better despite the 4 times larger cache and the 62.5% higher number of cores. In fact there is also a slight hit of 3 percentage points to cache hit rate. Using the microbenchmarking code from [WPSAM10] we were able to deduce that the size of the read-only data cache is seemingly only 12KB and not 48KB (assuming there is no mistake on our side), which might explain why there is no improvement in cache hit rate. Turning ECC off resulted in less than one percent runtime improvement. Thus bandwidth from device memory to L2 cache does not seem to be the main bottleneck.



Figure 17: **Geforce GTX 680** - **San Miguel** - Baseline trace kernel profiling graph. Nodes are either stored in global (left) or texture memory (right).



Figure 18: **Geforce GTX 680** - **Crytek Sponza** - Trace kernel profiling graph for the best layout in global memory (left, TDFS 0.6, AoS) and texture memory (right, TDFS 0.6, AoS).

| Sponza, GMem, TDFS 0.6, AoS           |                                                                        |                                                         |      |  |  |  |  |
|---------------------------------------|------------------------------------------------------------------------|---------------------------------------------------------|------|--|--|--|--|
| Runtime (ms)                          | 2.5                                                                    | 5 (-0.1) / 14.4 (-0.1) / 17.4 (-0.1) / 2.2 (+0.0)       | 1    |  |  |  |  |
| Global load transact./req.            | 2                                                                      | .0 (+0.0) / 5.3 (+0.0) / 9.6 (+0.0) / 0.7 (+0.0)        | 1    |  |  |  |  |
| L2 load hit rate (%)                  | 81.                                                                    | 0 (+0.5) / 86.7 (+0.1) / 93.0 (-0.1) / 1.1 (+0.0)       | 1    |  |  |  |  |
| L2 load size (GB)                     |                                                                        | 55.6 (-0.0)                                             |      |  |  |  |  |
| L2 load bandwidth (GB/s)              | 72.5                                                                   | (+0.1) / 101.9 (+0.4) / 112.5 (+1.0) / 3.3 (+0.1)       |      |  |  |  |  |
| L2 load size (GB)                     |                                                                        | 61.6 (-0.0)                                             |      |  |  |  |  |
| Dev. mem. load size (GB)              |                                                                        | 11.2 (-0.0)                                             |      |  |  |  |  |
| Inst. replay overhead (%)             | play overhead (%) 24.8 (+0.0) / 29.7 (-0.1) / 36.2 (-0.3) / 1.1 (+0.0) |                                                         |      |  |  |  |  |
| Sponza, TMem, TDFS 0.6, AoS           |                                                                        |                                                         |      |  |  |  |  |
| Runtime (ms)                          |                                                                        | 1.4 (-0.1) / 9.2 (-0.1) / 10.9 (+0.0) / 1.2 (+0         | .0)  |  |  |  |  |
| Tex cache hit rate (%)                |                                                                        | 49.4 (+0.1) / 76.5 (+0.1) / 99.4 (+0.0) / 4.4 (-        | 0.1) |  |  |  |  |
| Tex load size (GB)                    |                                                                        | 111.2 (+0.0)                                            |      |  |  |  |  |
| Tex load bandwidth (GB/               | /s)                                                                    | 231.0 (+4.8) / 299.4 (+1.5) / 369.3 (+0.7) / 9.3 (-0.4) |      |  |  |  |  |
| Tex←L2 load hit rate (%               | b)                                                                     | 62.1 (+0.2) / 69.5 (+0.0) / 95.7 (+0.1) / 2.8 (-0.1)    |      |  |  |  |  |
| Tex←L2 load size (GB)                 | )                                                                      | 25.2 (-0.1)                                             |      |  |  |  |  |
| Tex $\leftarrow$ L2 load bandwidth (C | GB/s)                                                                  | 2.1 (-0.1) / 68.3 (+0.1) / 115.4 (+2.2) / 11.1 (+0.1)   |      |  |  |  |  |
| L2 load size (GB)                     |                                                                        | 26.9 (-0.1)                                             |      |  |  |  |  |
| Dev. mem. load size (GE               | 3)                                                                     | 9.4 (-0.0)                                              |      |  |  |  |  |
| Inst. replay overhead (%              | )                                                                      | 0.7 (+0.0) / 1.3 (+0.0) / 2.0 (+0.0) / 0.1 (+0.         | .0)  |  |  |  |  |

Table 6: **Geforce GTX 680 - Crytek Sponza -** Trace kernel profiling data totals for TDFS 0.6, AoS in global memory (top) and in texture memory (bottom).

| Kitchen, GMem, TDFS 0.6, AoS          |           |                                                       |        |  |  |  |  |
|---------------------------------------|-----------|-------------------------------------------------------|--------|--|--|--|--|
| Runtime (ms)                          | 0.4       | (+0.0) / 12.5 (-0.1) / 16.5 (-0.1) / 4.0 (+0.0)       |        |  |  |  |  |
| Global load transact./req.            | 2.        | 1 (+0.0) / 7.5 (+0.0) / 10.9 (+0.0) / 0.9 (+0.0)      |        |  |  |  |  |
| L2 load hit rate (%)                  | 82.       | 5 (+0.0) / 89.1 (+0.1) / 90.6 (+0.2) / 0.6 (+0.1)     |        |  |  |  |  |
| L2 load size (GB)                     |           | 134.8 (-0.0)                                          |        |  |  |  |  |
| L2 load bandwidth (GB/s)              | 73.7      | (+2.1) / 105.7 (+0.6) / 112.8 (+0.4) / 4.5 (+0.1)     |        |  |  |  |  |
| L2 load size (GB)                     |           | 152.2 (-0.0)                                          |        |  |  |  |  |
| Dev. mem. load size (GB)              |           | 21.9 (-0.2)                                           |        |  |  |  |  |
| Inst. replay overhead (%)             | 23.       | 4 (+0.0) / 32.5 (-0.1) / 36.2 (+0.0) / 1.2 (+0.0)     |        |  |  |  |  |
| Kitchen, TMem, TDFS 0.6, AoS          |           |                                                       |        |  |  |  |  |
| Runtime (ms)                          |           | 0.2 (+0.0) / 7.2 (+0.1) / 9.6 (+0.1) / 2.3 (+0        | .0)    |  |  |  |  |
| Tex cache hit rate (%)                |           | 55.2 (+0.3) / 65.6 (+0.2) / 100.0 (+0.0) / 3.7 (+0.0) |        |  |  |  |  |
| Tex load size (GB)                    |           | 217.5 (-0.0)                                          |        |  |  |  |  |
| Tex load bandwidth (GB/               | /s)       | 188.6 (+3.8) / 281.4 (-2.2) / 356.1 (-6.5) / 12.0     | (-0.6) |  |  |  |  |
| Tex←L2 load hit rate (%               | b)        | 71.9 (-0.6) / 78.3 (+0.1) / 80.7 (-0.1) / 1.4 (+0.0)  |        |  |  |  |  |
| Tex←L2 load size (GB)                 | )         | 70.8 (-0.4)                                           |        |  |  |  |  |
| Tex $\leftarrow$ L2 load bandwidth (C | GB/s)     | 0.1 (+0.0) / 94.8 (-1.2) / 122.2 (-1.3) / 8.5 (-      | -0.2)  |  |  |  |  |
| L2 load size (GB)                     |           | 75.3 (-0.4)                                           |        |  |  |  |  |
| Dev. mem. load size (GE               | <u>3)</u> | 18.9 (-0.2)                                           |        |  |  |  |  |
| Inst. replay overhead (%              | )         | 1.2 (+0.0) / 1.6 (+0.0) / 2.5 (+0.0) / 0.2 (+0        | .0)    |  |  |  |  |

Table 7: **Geforce GTX 680 - Kitchen -** Trace kernel profiling data totals for TDFS 0.6, AoS in global memory (top) and in texture memory (bottom).

| Hairball - Glass, GMem, TDFS 0.6, AoS |                                                                               |                                                         |       |  |  |  |  |
|---------------------------------------|-------------------------------------------------------------------------------|---------------------------------------------------------|-------|--|--|--|--|
| Runtime (ms)                          | 2.3                                                                           | +0.1) / 39.6 (-0.4) / 73.2 (-1.1) / 23.8 (-0.3)         |       |  |  |  |  |
| Global load transact./req.            | 2.                                                                            | 3 (+0.0) / 5.1 (+0.0) / 6.2 (+0.0) / 0.7 (+0.0)         | I     |  |  |  |  |
| L2 load hit rate (%)                  | 45.8                                                                          | 45.8 (-0.4) / 60.5 (-0.2) / 94.0 (-0.1) / 6.6 (+0.0)    |       |  |  |  |  |
| L2 load size (GB)                     |                                                                               | 481.0 (+0.2)                                            |       |  |  |  |  |
| L2 load bandwidth (GB/s)              | 15.4                                                                          | (+0.1) / 63.4 $(+0.6)$ / 84.3 $(-0.5)$ / 13.1 $(+0.1)$  | l     |  |  |  |  |
| L2 load size (GB)                     |                                                                               | 573.5 (+0.2)                                            | I     |  |  |  |  |
| Dev. mem. load size (GB)              |                                                                               | 258.9 (+1.4)                                            | I     |  |  |  |  |
| Inst. replay overhead (%)             | t. replay overhead (%) $21.1 (+0.0) / 29.3 (-0.1) / 35.4 (-0.3) / 2.4 (-0.1)$ |                                                         |       |  |  |  |  |
| hairball-glass, TMem, TDFS 0.6, AoS   |                                                                               |                                                         |       |  |  |  |  |
| Runtime (ms)                          |                                                                               | 1.7 (+0.0) / 28.2 (-0.1) / 52.7 (-0.2) / 16.6 (-        | -0.1) |  |  |  |  |
| Tex cache hit rate (%)                |                                                                               | 48.3 (+0.1) / 59.8 (+0.0) / 100.0 (+0.0) / 7.0 (        | -0.1) |  |  |  |  |
| Tex load size (GB)                    |                                                                               | 777.5 (+0.0)                                            |       |  |  |  |  |
| Tex load bandwidth (GB/               | /s)                                                                           | 22.2 (+0.0) / 132.2 (+0.6) / 355.5 (+0.2) / 37.9 (-0.2) |       |  |  |  |  |
| Tex $\leftarrow$ L2 load hit rate (%  | b)                                                                            | 23.6 (-0.2) / 31.6 (-0.4) / 92.1 (-0.4) / 5.3 (+0.0)    |       |  |  |  |  |
| Tex←L2 load size (GB)                 | )                                                                             | 283.2 (+0.1)                                            |       |  |  |  |  |
| Tex $\leftarrow$ L2 load bandwidth (C | GB/s)                                                                         | 0.1 (+0.0) / 48.6 (+0.3) / 62.1 (+0.4) / 10.7 (+0.0)    |       |  |  |  |  |
| L2 load size (GB)                     |                                                                               | 305.0 (+0.1)                                            |       |  |  |  |  |
| Dev. mem. load size (GE               | 3)                                                                            | 211.2 (+1.2)                                            |       |  |  |  |  |
| Inst. replay overhead (%              | )                                                                             | 0.8 (+0.0) / 1.7 (+0.0) / 2.1 (+0.0) / 0.2 (+0.0)       | ).0)  |  |  |  |  |

Table 8: **Geforce GTX 680** - **Hairball - Glass** - Trace kernel profiling data totals for TDFS 0.6, AoS in global memory (top) and in texture memory (bottom).

| San Miguel, GMem, TDFS 0.6, AoS                                                |       |                                                          |       |  |  |  |  |
|--------------------------------------------------------------------------------|-------|----------------------------------------------------------|-------|--|--|--|--|
| Runtime (ms)                                                                   | 2.3   | (-0.2) / 33.3 (-2.1) / 40.2 (-2.2) / 5.1 (-0.4)          |       |  |  |  |  |
| Global load transact./req.                                                     | 3.    | 2 (+0.0) / 6.2 (+0.0) / 7.6 (+0.0) / 0.3 (+0.0)          |       |  |  |  |  |
| L2 load hit rate (%)                                                           | 63.3  | 63.3 (+0.8) / 71.4 (+0.8) / 94.0 (+0.0) / 1.4 (-0.1)     |       |  |  |  |  |
| L2 load size (GB)                                                              |       | 186.1 (-0.3)                                             |       |  |  |  |  |
| L2 load bandwidth (GB/s)                                                       | 70.8  | (+5.3) / 94.5 (+5.6) / 100.9 (+1.1) / 2.6 (+0.0)         |       |  |  |  |  |
| L2 load size (GB)                                                              |       | 204.7 (-0.3)                                             |       |  |  |  |  |
| Dev. mem. load size (GB)                                                       |       | 69.4 (-1.5)                                              |       |  |  |  |  |
| Inst. replay overhead (%) 26.6 (+0.1) / 33.8 (-0.7) / 39.2 (-1.1) / 0.9 (-0.1) |       |                                                          |       |  |  |  |  |
| San Miguel, TMem, TDFS 0.6, AoS                                                |       |                                                          |       |  |  |  |  |
| Runtime (ms)                                                                   |       | 1.4 (-0.1) / 20.9 (-0.9) / 25.7 (-0.8) / 3.2 (-          | -0.1) |  |  |  |  |
| Tex cache hit rate (%)                                                         |       | 46.2 (+0.1) / 61.0 (+0.1) / 96.0 (+0.0) / 3.3 (-0.1)     |       |  |  |  |  |
| Tex load size (GB)                                                             |       | 262.3 (+0.0)                                             |       |  |  |  |  |
| Tex load bandwidth (GB/                                                        | 's)   | 135.8 (+12.0) / 201.1 (+8.4) / 337.3 (-1.1) / 8.9 (-0.6) |       |  |  |  |  |
| Tex←L2 load hit rate (%                                                        | )     | 43.2 (+0.4) / 46.2 (+0.2) / 85.7 (+0.0) / 1.4 (-0.1)     |       |  |  |  |  |
| Tex←L2 load size (GB)                                                          | )     | 99.8 (-0.2)                                              |       |  |  |  |  |
| Tex←L2 load bandwidth (C                                                       | iB/s) | 13.2 (+0.0) / 76.7 (+3.2) / 90.1 (+5.5) / 4.0 (+0.4)     |       |  |  |  |  |
| L2 load size (GB)                                                              |       | 103.9 (-0.2)                                             |       |  |  |  |  |
| Dev. mem. load size (GE                                                        | B)    | 57.8 (-0.4)                                              |       |  |  |  |  |
| Inst. replay overhead (%                                                       | )     | 0.8 (+0.0) / $1.4 (+0.0)$ / $1.6 (+0.0)$ / $0.1 (+$      | 0.0)  |  |  |  |  |

Table 9: **Geforce GTX 680** - **San Miguel** - Trace kernel profiling data totals for TDFS 0.6, AoS in global memory (top) and in texture memory (bottom).



Figure 19: **Geforce GTX 680** - **Kitchen** - Trace kernel profiling graph for the best layout in global memory (left, TDFS 0.6, AoS) and texture memory (right, TDFS 0.6, AoS).



Figure 20: **Geforce GTX 680** - **Hairball - Glass** - Trace kernel profiling graph for the best layout in global memory (left, TDFS 0.6, AoS) and texture memory (right, TDFS 0.6, AoS).



Figure 21: **Geforce GTX 680 - San Miguel** - Trace kernel profiling graph for the best layout in global memory (left, TDFS 0.6, AoS) and texture memory (right, TDFS 0.6, AoS).



Figure 22: **Tesla K20c/Geforce GTX 680** - **Crytek Sponza** - Trace kernel profiling graph for the best layout in global memory (left, TDFS 0.6, AoS) and texture memory (right, TDFS 0.6, AoS).



Figure 23: **Tesla K20c/Geforce GTX 680** - **Kitchen** - Baseline trace kernel profiling graphs for the GTX 680 (left) and Tesla K20c (right). The GTX 680 accesses nodes via texture memory, while the Tesla K20c accesses nodes via the read-only data cache.



Figure 24: **Tesla K20c/Geforce GTX 680** - **Hairball - Glass** - Baseline trace kernel profiling graphs for the GTX 680 (left) and Tesla K20c (right). The GTX 680 accesses nodes via texture memory, while the Tesla K20c accesses nodes via the read-only data cache.



Figure 25: **Tesla K20c/Geforce GTX 680** - **San Miguel** - Baseline trace kernel profiling graphs for the GTX 680 (left) and Tesla K20c (right). The GTX 680 accesses nodes via texture memory, while the Tesla K20c accesses nodes via the read-only data cache.

# 6 Overall Comparison

An overall observation we can make is, that the node layout has the largest impact on performance for both GPU architectures. The AoS layout performed best in both memory areas, except for Fermi and global memory, where SoA32\_24 performed best. Our treelet based layout managed to achieve the best performance gains for both architectures though they are only moderate. On average the common DFS layout performed worst for all node layouts in both memory areas and architectures. Excluding layouts that use statistics the equally simple to construct BFS layout on average performed best and similar to the TDFS layout.

# Acknowledgments

The work of S. Widmer and D. Wodniok is supported by the 'Excellence Initiative' of the German Federal and State Governments and the Graduate School of Computational Engineering at Technische Universität Darmstadt. Hairball scene courtesy of Samuli Laine. Crytek-sponza scene courtesy of Frank Meinl. San-miguel scene courtesy of Guillermo M. Leal Llaguno.

# References

- [AL09] Timo Aila and Samuli Laine. Understanding the efficiency of ray traversal on GPUs. In *Proc. HPG*, 2009.
- [ALK12] Timo Aila, Samuli Laine, and Tero Karras. Understanding the efficiency of ray traversal on GPUs – Kepler and Fermi addendum. Technical Report NVR-2012-02, 2012.
- [NVIa] NVIDIA. CUDA C Programming Guide. http://docs.nvidia. com/cuda/cuda-c-programming-guide/index.html.
- [NVIb] NVIDIA. Cuda Profiling Tools Interface. developer.nvidia. com/cuda-profiling-tools-interface.
- [Sm] Crytek Sponza model. http://www.crytek.com/cryengine/ cryengine3/downloads.
- [WPSAM10] H. Wong, M.-M. Papadopoulou, M. Sadooghi-Alvandi, and A. Moshovos. Demystifying GPU microarchitecture through microbenchmarking. In *Proc. ISPASS*, 2010.
- [WSWG13] D. Wodniok, A. Schulz, S. Widmer, and M. Goesele. Analysis of cache behavior and performance of different BVH memory layouts for tracing incoherent rays. In *Proc. EGPGV*, 2013.