Final Report

SUMMARY

I implemented a ray tracing algorithm that uses packet-based traversal techniques and relies of high quality SAH-based bounding volume hierarchies. I also implemented a simple parallel build algorithm for said bounding volume hierarchies.

BACKGROUND

What is Ray Tracing ?

Ray Tracing is a technique for generating an image by tracing the path of light through pixels in an image plane and simulating the effects of its encounters with virtual objects. Ray tracing is capable of simulating a wide variety of optical effects, such as reflection and refraction, but at a higher computational cost than other algorithms like rasterization.

Individual light rays are shot from the viewer (the eye, or the pixel ) into the scene. When a ray hits a surface, a shadow ray is shot from the intersection to test if a surface is visible to a light. If it is, the light contributes to the object’s color, otherwise not.

Since ray tracing follows the laws of physics, very photo-realistic images can be produced with it and there is increasing interest towards making the algorithm and the building of data structures it uses fast enough to be in interactive and real-time applications.

Bounding Volume Hierarchies

Bounding Volume Hierarchies are tree structures used to organize primitives (primitives can be any polygons or spheres, most common are triangles) in a scene. The main objective in using a BVH to organize a scene is to minimize cost of finding closest intersection of ray with a primitive in the scene. Each primitive is enclosed in a an axis aligned bounding box. Interior nodes store subset of primitives and aggregate bounding volumes for all primitives in subtree whereas leaf nodes store a list of primitives.

The root node contains all the primitives and is recursively split into sub trees with a spatial partition chosen to minimize cost, where the cost of each spatial split is the sum of the surface area of the aggregate sub-tree bounding volume of each weighed by the number of primitives it contains respectively.


APPROACH

Building a BVH in Parallel

Since building the BVH is essentially and divide and conquer problem, and each sub-tree is completely independent of others, different threads can work on splitting different sub-trees without any need for synchronization. There is also a significant performance gained by using a iterative stack-based method of building BVH as opposed to recursive calls. However, as illustrated in the diagram below, there is a significant bottleneck in the top levels of the tree. Available threads do not have enough sub-trees to work on.

One solution is to have all the threads work on one node by splitting the primitives amongst the threads as shown in the figure below. Each thread works on computing partial cost of splitting node by each partition plane. Once all the threads have finished this sep, one thread combines costs computed by all threads and finds the minimum cost partition.This does require a barrier but the next step is sequential, but it is small and fast.

This approach works well in the higher levels of the tree when the number of primitives is far greater than number of threads available. Once enough sub-trees are available, I switch to node-level parallelism.

Packet Based Traversal

A ray’s point of intersection with the scene is found by descending nodes of the BVH and discarding all primitives contained in a sub-tree if the ray does not intersect with the bounding box of the node itself. The basic idea in a packet-based traversal method is to traverse the BVH one ‘packet’ at a time, rather than one ‘ray’ at a time. Each packet keeps track of its first active ray. If the first active ray intersects with a node, the entire packet immediately descends into the node. If not, then I find the next active ray in the packet, marking previous ones inactive.

When a packet reaches a leaf, each ray in the packet must be intersected with the primitives in the leaf. I use SIMD instructions to perform this in parallel via Intel’s ISPC. For this, I store a packet of rays as a structure of arrays to be more compatible with SIMD loads and stores. Packet size is a multiple of the vector width being used so as to prevent under-utilization of vector lanes. Tracking the first active ray also helps save SIMD intersection calls on all inactive rays at the beginning of the packet.

However, there are costs associated with using packets as well. Divergence amongst BVH traversal and primitive intersections amongst individual rays in the packet leads to a performance drop for obvious reasons. For example, one ray can drag down an entire packet along with it into a node even if all other rays do not intersect with it. For this reason, it is important for rays belonging to one packet to be as coherent as possible. Then, the maximum benefits of packets are seen. As expected, packet-based traversal performs poorly on over-detailed scenes as primitives are too small for majority rays in a packet to intersect the same ones.

One thing to note is that each ray can be traced completely independently of another and thus ray-tracing is an inherently parallel algorithm. The starter code I used (Fall 2015 15-462's assignment 3) already exploits this fact by including the functionality to have multiple threads working on different ‘tiles’ of an image concurrently. I compare the performance of my packet-based ISPC version of a ray tracer to single ray based one at different thread counts, later, in the results section.

RESULTS

Parallel BVH Build

Speedup compared to Single Threaded Build
Scene Size : ~1M Primitives

>Maximum reached at 6 threads, which the the number of physical cores on the machine the program was run on.

Packet Traversal

This is the speedup achieved compared to the performance of single ray code running with one thread for a medium sized scene that fits in the L2 cache of the machine. Below is the rendered image.


Effect of Packet Size and Vector Width on Performance.

There is a clear advantage to increasing packet size for highly coherent scenes where rays in a packet have similar behaviour with respect to many primitves. On the other hand, AVX provides almost no advantage over SSE-4, possibly due to higher possibility of divergence within a vector-width number of adjacent rays in a packet.

This observation leads to the conclusion that packetizing camera and shadow rays makes sense because they have a much higher chance of being coherent with respect to one another. However, for diffuse bounces, rays are essentially in random directions and hence, indirect lighting computation does not benefit from packetization. In fact, I observed a performance drop when using packets for diffuse bounces compared to single ray code.

REFERENCES

My starting point was the ray tracer I implemented in 15-462 (Fall 2015). I followed suggestions for parallelizing an SAH-BVH build given in Wald’s ‘On fast Construction of SAH-based Bounding Volume Hierarchies’ and ‘Ray Tracing Deformable Scenes Using Dynamic Bounding Volume Hierarchies’ for packet-based ray traversal.