Inria SIMD Blog
Note: This page is mirrored from my page on CGAL's internal wiki, apologies for any incorrect formatting.
Contents
- 1 Introduction
- 2 Project Goals
- 3 Timeline
- 4 Progress
- 4.1 May 1 - May 17
- 4.2 May 18 - June 1
- 4.3 June 2 - June 7
- 4.4 June 8 - June 15
- 4.5 June 16 - June 22
- 4.6 June 23 - June 30
- 4.7 July 1 - July 8
- 4.8 July 9 - July 16
- 4.9 July 17 - July 24
- 4.10 July 25 - August 2
- 4.11 August 3 - August 10
- 4.11.1 Feasibility of a Compact Tree Structure
- 4.11.2 Potential API for an Implicit Tree
- 4.11.3 Preparing to implement Implicit Tree construction
- 4.11.4 The source of N-Way Tree's performance advantage
- 4.11.5 Building a tree using a sort along the Hilbert curve
- 4.11.6 Building a Complete tree with repeated subdivision
- 4.11.7 Better benchmarks for Hilbert-sort construction
- 4.11.8 Improving Hilbert sort construction performance
- 4.11.9 More challenging benchmarks
- 4.12 August 11 - August 17
- 5 Proposed Optimizations
- 6 Queue
1 Introduction
CGAL’s AABB-Tree is an acceleration structure which speeds up common tasks such as collision-detection. It is used both directly and throughout other packages in the library, so any performance improvements made to this package will pay dividends elsewhere. The package has already been a major target of performance optimization, but one approach that hasn’t yet been fully explored is the use of SIMD, a type of parallel computation where the processor operates on several data values at a time. The introduction of SIMD optimizations to the AABB tree has the potential to substantially improve performance by better taking advantage of the capabilities of modern computer processors.
2 Project Goals
2.1 Minimum Viable Product
- Benchmarks which compare the performance of CGAL’s AABB-tree and other implementations with realistic workloads. This includes benchmarks of both intersection and distance query functionality, for a variety of primitives.
- Documentation of targets for improvement in the current implementation, based on profiling with large-scale benchmarks.
- Selection of a SIMD library, and incorporation of SIMD techniques into the AABB-tree.
- Benchmarks which demonstrate an improvement in performance when SIMD is enabled.
2.2 Stretch Goals
- Performance that approaches that of Embree’s Bounding Volume Hierarchy.
- Application of similar optimization techniques to other tree packages in CGAL, including the kD-Tree and the Orthtree, the subject of my Google Summer of Code assignment in 2020.
3 Timeline
3.1 Now - June 7
Community Bonding Period
- Learn about the AABB-tree’s functionality as well as its implementation details.
- Build benchmarks for existing tree structures
- Select a library for SIMD
3.2 June 7 - June 21
- Collect performance data through profiling using Intel’s VTune
- Assemble a table with areas where Embree has the largest performance advantages
[Benchmark Pull Request Merged]
3.3 June 21 - July 7
- Document optimizations used by Embree
- Rank optimizations by expected performance yield, and feasibility with respect to the AABB-tree’s current structure.
3.4 July 7 - July 21
- Apply high-yield optimizations which don’t conflict with AABB-tree’s existing structure.
- Apply any optimizations which are self-contained (for example, ones which only require changes to individual functions)
- Benchmark each change, and use unit tests to confirm that no changes conflict with one another
3.5 July 21 - August 7
- Apply high-yield optimizations which may require more invasive changes to AABB-tree’s implementation
- Benchmark new changes, repeat integration test
3.6 August 7 - August 14
- Apply any remaining optimizations that only require simple changes to the AABB-tree.
- Extend AABB-tree manual to discuss all new optimizations.
[Optimization Pull Request Merged]
4 Progress
4.1 May 1 - May 17
4.1.1 Architectural Analysis
Ayush Saraswat's work during GSoC 2020 demonstrated a massive performance advantage in favor of Embree (over CGAL's AABB-tree). It remains to be seen how much of this advantage comes from SIMD, and how much is architectural.
Pierre Alliez suggested I look at a particular survey of contemporary Bounding Volume Hierarchy design practices. The survey discusses a number of features and optimizations which trees might include. Characterizing both trees with regard to this survey was a good way of doing a motivated tour of the code-bases.
Feature | CGAL | Embree | Description |
---|---|---|---|
Construction Strategy | Top-down | Top-down | Is the tree built root-first (then split), or leaves first (then combined)? |
Incremental Construction | Yes | No | Can primitives be added to the tree without completely reconstructing it? |
Linear BVH | No | Yes (Morton codes) | Can the tree be built faster by first sorting primitives along a space-filling curve? |
Topology Optimization | No? | Yes (Binned SAH) | Can the tree be refined so that it's optimal by some measure, for example minimal total surface area of bounding boxes? |
Subtree Collapsing | No | No? | Can leaf nodes contain more than N primitives? |
Data Layout Approach | Naive? | Preallocated, aligned | How is the tree explicitly arranged in memory to reduce cache misses during common operations? |
Hardware Acceleration | Multithreading | Multithreading, SIMD | What hardware features does the tree take advantage of? |
Spatial Splits | No | Yes | Can primitives be broken into multiple parts along bbox boundaries? |
Wide BVH | No | Yes (N-way splits) | Can the space be partitioned in more than two ways at each node? |
Non-Polygonal Objects | No | Yes | Does the tree provide optimized support for different primitive types? |
Compact Representation | Lower-precision bbox | Naive? | What efforts does the tree make to avoid taking up too much memory? |
First-hit Traversal | Yes | Yes | Can you use the tree to find the first primitive a ray intersects with? |
Any-hit Traversal | Yes | Yes | Can you use the tree to find out whether or not a ray intersects with any primitives? |
Multi-hit Traversal | Yes | No | Can you use the tree to find the list of primitives a ray intersects with? |
Stream Traversal | No? | Yes | Does the tree provide accelerated support for intersections with groups of similar rays? |
Stackless Traversal | No | Yes? (Skip-links) | Is traversal done non-recursively, instead following a ray directly between adjacent nodes? |
Ray-locality Opt. | No | No | Can you traverse the tree without loading faraway nodes into memory? |
CGAL's AABB-tree seems a lot more "vanilla" compared to Embree, and I think that's because a lot of Embree's optimizations are incompatible with CGAL's high level of genericity. For example, it might be impossible to generate morton codes for a lot of CGAL's kernels, and explicit data layout might not be useful for variable-precision types. CGAL naturally can't provide a lot of the same guarantees about the data being worked with, and that excludes certain optimizations.
I think Embree's Wide BVH implementation is a good example of how it takes advantage of SIMD.
Each node of the tree has bounds defined using a custom vector type of size N (e.g. vfloat<N>
lower_x
).
When N is less than the number of SIMD units, bounds checking can be
done in a similar number of instructions to a conventional (two-child)
tree!
Notably, the underlying data structure of the vfloat<N>
type is a simple C-style array.
As I understand it, the team that develops Embree works closely with the
Intel compiler team to make sure that their code can be compiled to
packed SIMD instructions,
rather than using intrinsics explicitly.
I was hoping Ayush Saraswat's benchmarks would be indicative of the performance gain SIMD can offer, but it seems like the two trees might just be too different to compare directly like that. Embree has some advantage because it doesn't need to guarantee exactness, and some because of its more elaborate architecture. In order to determine where SIMD is making the biggest difference, it's necessary to either quantify or eliminate these other advantages.
4.1.2 Prevalence of SIMD Instructions in Existing Code
Before attempting to introduce SIMD, an important question is how much the existing code already takes advantage of vectorized instructions. Modern compilers are very adept at vectorizing code that may not appear to use SIMD, by unrolling loops or recognizing blocks of code with implicitly parallel behavior.
The laptop I'm working on has an Intel i7-7700HQ CPU, which supports instruction sets up to SSE 4.2 and AVX 2. Depending on what compiler flags are set in CGAL's default release config, this means that the emitted assembly should at least contain some SSE instructions, and perhaps AVX instructions too (though in my experience, AVX must be explicitly enabled).
To determine this, I compiled the AABB-tree's Ray-shooting example in release mode, with no modifications to CGAL's default compiler flags. I counted packed SIMD instructions with the help of awk (piped to wc), a modification of the technique discussed here.
Instruction Set | # of Instructions | % of Program |
---|---|---|
All | 23,256 | 100% |
SSE | 446 | 1.917% |
SSE2 | 401 | 1.724% |
SSE3 | 0 | 0.00% |
SSSE3 | 0 | 0.00% |
SSE4 | 0 | 0.00% |
AVX | 0 | 0.00% |
To my surprise, the program contains no SIMD instructions beyond SSE2. It remains to be seen what the cause of this is, but this is actually very good news for this project: because the existing code isn't already taking full advantage of the CPU, I have a lot of performance headroom!
Further examination shows that CGAL's default release config for CMake on Fedora does not enable
-march=native
.
This is because it's intended to be distributed as a binary (with broad
hardware compatibility).
Because CGAL is a header-only library, users don't need to accept the
same compromise-- they can build with whatever flags they like.
4.2 May 18 - June 1
4.2.1 Examining Ray-BBox Intersection
According to included comments, CGAL's ray-bbox intersection algorithm is inspired by this paper. On Pierre's suggestion, I'm exploring how SIMD could be used to accelerate this operation.
NOTE: Technically this algorithm performs intersections between line segments and bounding boxes, but it can generalize to infinite rays. Both the algorithm and CGAL refer to it as Ray-BBox intersection, so I will remain consistent with that pattern.
First, I read through the discussed algorithm, and traced its operation on paper for a very simple example case.
Next, I put together a simplified version of the algorithm in the paper. This is somewhat closer to psuedocode than the paper, it's also written more verbosely, to help understand exactly what's happening.
class BBox { Vector3 bounds[2]; } class Ray { Vector3 origin, direction, inv_direction; int sign[3]; // I'd like a more semantically appropriate way of representing this! } bool intersects(BBox box, Ray ray, float rmin, float rmax) { // Determine bounds for x and y float xmin = (box.bounds[ray.sign[0]].x - r.origin.x) * r.inv_direction.x; float xmax = (box.bounds[1 - ray.sign[0]].x - r.origin.x) * r.inv_direction.x; float ymin = (box.bounds[ray.sign[1]].y - r.origin.y) * r.inv_direction.y; float ymax = (box.bounds[1 - ray.sign[1]].y - r.origin.y) * r.inv_direction.y; // If the x and y bounds don't overlap, the ray doesn't intersect with the box if (xmin > ymax || ymin > xmax) return false; // Determine the bounds of the overlapping region min = MAX(xmin, ymin); max = MIN(xmax, ymax); // Determine bounds for z float zmin = (box.bounds[ray.sign[2]].z - r.origin.z) * r.inv_direction.z; float zmax = (box.bounds[1 - ray.sign[2]].z - r.origin.z) * r.inv_direction.z; // If the z bounds don't overlap with the existing region, the ray doesn't intercept if (zmin > max || min > zmax) return false; // Update the bounds to find the region where x, y, and z overlap min = MAX(min, zmin); max = MIN(max, zmax); // The ray intercepts if this region overlaps with the bounds provided return (min < rmax) && (max > rmin); }
Once I had a starting point, it was possible to start thinking about how the code could be vectorized. Intel's official guide was a useful starting point when reasoning about what autovectorizer-friendly code looks like.
A more SIMD-optimal version might look something like this:
(Assignments that are grouped together are those that I think will be vectorized, e.g. all mins will be calculated as a group)
bool intersects(BBox box, Ray ray, float rmin, float rmax) { // Determine minimum bounds for x, y, and z float xmin = (box.bounds[ray.sign[0]].x - r.origin.x) * r.inv_direction.x; float ymin = (box.bounds[ray.sign[1]].y - r.origin.y) * r.inv_direction.y; float zmin = (box.bounds[ray.sign[2]].z - r.origin.z) * r.inv_direction.z; // Determine maximum bounds for x, y, and z float xmax = (box.bounds[1 - ray.sign[0]].x - r.origin.x) * r.inv_direction.x; float ymax = (box.bounds[1 - ray.sign[1]].y - r.origin.y) * r.inv_direction.y; float zmax = (box.bounds[1 - ray.sign[2]].z - r.origin.z) * r.inv_direction.z; // Determine the minimum bounds of the region where x, y, and z overlap min = MAX(xmin, ymin, zmin); // Determine the maximum bounds of the region where x, y, and z overlap max = MIN(xmax, ymax, zmax); // The ray intercepts if this region exists and overlaps with the bounds provided return (max > min) && (min < rmax) && (max > rmin); }
Notice how we lose our early-exits: this function is now branch-free, for better or worse. This enables wider SIMD operations (finding all three mins at once rather than finding z separately), but it may mean that this has worse best-case performance.
The compiler may pad the vector values so that they contain four elements rather than three. If it's smart enough, this can enable the use of appropriate instructions to rapidly find the min and max values.
This code still finds the min and max values separately, but the compiler likely splits things up more to enable even wider SIMD operations. The following code might compile to exactly the same thing; because it's only a change to the order of operations, the compiler should be smart enough to make this optimization automatically:
(In this code I also explicitly acknowledge the padding values I might expect the vectorizer to add)
bool intersects(BBox box, Ray ray, float rmin, float rmax) { // Determine intermediate value for minimum bounds float xmin = box.bounds[ray.sign[0]].x; float ymin = box.bounds[ray.sign[1]].y; float zmin = box.bounds[ray.sign[2]].z; /* pad_min = FLOAT_MAX; */ // Determine intermediate value for maximum bounds float xmax = box.bounds[1 - ray.sign[0]].x; float ymax = box.bounds[1 - ray.sign[1]].y; float zmax = box.bounds[1 - ray.sign[2]].z; /* pad_max = FLOAT_MIN; */ // Apply transform to all bounds float xmin = (xmin - r.origin.x) * r.inv_direction.x; float ymin = (ymin - r.origin.y) * r.inv_direction.y; float zmin = (zmin - r.origin.z) * r.inv_direction.z; /* pad_min = (pading - 0.0) * 1.0; /* float xmax = (zmax - r.origin.x) * r.inv_direction.x; float ymax = (ymax - r.origin.y) * r.inv_direction.y; float zmax = (zmax - r.origin.z) * r.inv_direction.z; /* pad_max = (pading - 0.0) * 1.0; /* // Determine the minimum bounds of the region where x, y, and z overlap min = MAX(xmin, ymin, zmin /*, pad_min */); // Determine the maximum bounds of the region where x, y, and z overlap max = MIN(xmax, ymax, zmax /*, pad_max */); // The ray intercepts if this region exists and overlaps with the bounds provided return (max > min) && (min < rmax) && (max > rmin); }
This arrangement is mathematically equivalent, and the compiler may choose to do something like this if it determines that it results in better performance (though I'm not certain that's the case!).
Note: These comments are primarily speculative, based on my understanding of intel's documentation.
4.2.2 Choosing a SIMD Approach
Note: ISPC is discussed in this fantastic blog
Most Readable |
---|
Compiler autovectorization |
OpenMP (#pragma omp simd ) |
High level libraries (e.g. Eigen) |
Low level libraries (wrappers over intrinsics) |
SIMD intrinsics |
Embedded assembly |
Most Explicit |
Approach | Pros | Cons |
---|---|---|
Compiler autovectorization |
|
|
OpenMP (#pragma omp simd ) |
|
|
High level libraries (e.g. Eigen) |
|
|
Low level libraries (wrappers over intrinsics) |
|
|
SIMD intrinsics |
|
|
Embedded assembly |
|
|
4.2.2.1 Libraries
Library | License | Notes |
---|---|---|
Closed | Originally planned for inclusion in Boost, sadly that was cancelled and the library was made closed source. | |
nsimd | MIT | Fork of Boost.SIMD |
xsimd | BSD 3-clause | Reimplementation of Boost.SIMD's API |
Vectorclass | GPLv3 | Library by Agner Fog |
libsimdpp | Boost | Provides Dynamic Dispatch |
std-simd | BSD 3-clause | std::experimental::simd , might someday be included in libstdc++! |
dimsum | Apache 2.0 | Google's own implementation of the same standard simd proposal. |
SIMD Everywhere | MIT | Enables the use of any flavor of intrinsics, on any platform (automatically substitutes unsupported intrinsics) |
Library | Pros | Cons |
---|---|---|
nsimd |
|
|
xsimd |
|
|
Vectorclass |
|
|
libsimdcpp |
|
|
std-simd |
|
|
dimsum |
|
|
SIMD Everywhere |
|
|
In my opinion, the list can be reduced to the following libraries, each with their own advantages:
- xsimd: the most actively developed
- libsimdcpp: might be the only library with macros to help with dynamic dispatch
- std-simd: the API is an ISO standard, and this may be included in a future version of C++
- SIMD everywhere: very different from the rest, can add cross-platform support so Marc Glisse's existing code (but provides no convenience layer over intrinsics)
4.2.3 Automated SIMD Prevalence Analysis
The earlier process of determining how many SIMD instructions were present in a binary file was relatively time-consuming. Because this is something I expect to do frequently in the course of the project, I decided to automate it. I created a simple bash script which counts the instructions as I did manually, and also calculates their percents to produce a table like the one shown earlier.
This work is available on a new repository, which I intend to use for future such experimentation.
4.3 June 2 - June 7
4.3.1 Ray-BBox Intersection Codes
In order to see how different compiler flags, optimizations, and SIMD libraries effect the produced binary and its performance I'm using a minimal test case. I assembled a collection of ray-bbox intersection tests for spartan implementations of Vector3, BBox, and Ray types. These are inspired by the paper discussed earlier, with various modifications added.
Name | Code | Description |
---|---|---|
Smits' Method | bool intersect_smits_method(const BBox &bbox, const Ray &ray, float t0, float t1) { double tmin, tmax, tymin, tymax, tzmin, tzmax; if (ray.direction().x() >= 0) { tmin = (bbox.bounds()[0].x() - ray.origin().x()) / ray.direction().x(); tmax = (bbox.bounds()[1].x() - ray.origin().x()) / ray.direction().x(); } else { tmin = (bbox.bounds()[1].x() - ray.origin().x()) / ray.direction().x(); tmax = (bbox.bounds()[0].x() - ray.origin().x()) / ray.direction().x(); } if (ray.direction().y() >= 0) { tymin = (bbox.bounds()[0].y() - ray.origin().y()) / ray.direction().y(); tymax = (bbox.bounds()[1].y() - ray.origin().y()) / ray.direction().y(); } else { tymin = (bbox.bounds()[1].y() - ray.origin().y()) / ray.direction().y(); tymax = (bbox.bounds()[0].y() - ray.origin().y()) / ray.direction().y(); } if ((tmin > tymax) || (tymin > tmax)) return false; if (tymin > tmin) tmin = tymin; if (tymax < tmax) tmax = tymax; if (ray.direction().z() >= 0) { tzmin = (bbox.bounds()[0].z() - ray.origin().z()) / ray.direction().z(); tzmax = (bbox.bounds()[1].z() - ray.origin().z()) / ray.direction().z(); } else { tzmin = (bbox.bounds()[1].z() - ray.origin().z()) / ray.direction().z(); tzmax = (bbox.bounds()[0].z() - ray.origin().z()) / ray.direction().z(); } if ((tmin > tzmax) || (tzmin > tmax)) return false; if (tzmin > tmin) tmin = tzmin; if (tzmax < tmax) tmax = tzmax; return (tmin < t1) && (tmax > t0); } | Included for completion's sake, this is a relatively naive approach discussed in the paper. The ray's inverse direction and sign are not pre-computed, so this should be expected to under-perform all other methods. |
Improved | bool intersect_improved(const BBox &bbox, const Ray &ray, float t0, float t1) { double tmin, tmax, tymin, tymax, tzmin, tzmax; tmin = (bbox.bounds()[ray.sign()[0]].x() - ray.origin().x()) * ray.inv_direction().x(); tmax = (bbox.bounds()[1 - ray.sign()[0]].x() - ray.origin().x()) * ray.inv_direction().x(); tymin = (bbox.bounds()[ray.sign()[1]].y() - ray.origin().y()) * ray.inv_direction().y(); tymax = (bbox.bounds()[1 - ray.sign()[1]].y() - ray.origin().y()) * ray.inv_direction().y(); if ((tmin > tymax) || (tymin > tmax)) return false; if (tymin > tmin) tmin = tymin; if (tymax < tmax) tmax = tymax; tzmin = (bbox.bounds()[ray.sign()[2]].z() - ray.origin().z()) * ray.inv_direction().z(); tzmax = (bbox.bounds()[1 - ray.sign()[2]].z() - ray.origin().z()) * ray.inv_direction().z(); if ((tmin > tzmax) || (tzmin > tmax)) return false; if (tzmin > tmin) tmin = tzmin; if (tzmax < tmax) tmax = tzmax; return (tmin < t1) && (tmax > t0); } | This is the modified version of smits' method proposed by the paper, which uses the pre-computed values for better performance. This approach will serve as the overall baseline. |
Clarified | bool intersect_clarified(const BBox &bbox, const Ray &ray, float rmin, float rmax) { // Determine bounds for x and y double xmin = (bbox.bounds()[ray.sign()[0]].x() - ray.origin().x()) * ray.inv_direction().x(); double xmax = (bbox.bounds()[1 - ray.sign()[0]].x() - ray.origin().x()) * ray.inv_direction().x(); double ymin = (bbox.bounds()[ray.sign()[1]].y() - ray.origin().y()) * ray.inv_direction().y(); double ymax = (bbox.bounds()[1 - ray.sign()[1]].y() - ray.origin().y()) * ray.inv_direction().y(); // If the x and y bounds don't overlap, the ray doesn't intersect with the box if (xmin > ymax || ymin > xmax) return false; // Determine the bounds of the overlapping region double min = std::max(xmin, ymin); double max = std::min(xmax, ymax); // Determine bounds for z double zmin = (bbox.bounds()[ray.sign()[2]].z() - ray.origin().z()) * ray.inv_direction().z(); double zmax = (bbox.bounds()[1 - ray.sign()[2]].z() - ray.origin().z()) * ray.inv_direction().z(); // If the z bounds don't overlap with the existing region, the ray doesn't intercept if (min > zmax || zmin > max) return false; // Update the bounds to find the region where x, y, and z overlap min = std::max(min, zmin); max = std::min(max, zmax); // The ray intercepts if this region overlaps with the bounds provided return (min < rmax) && (max > rmin); } | This is my own interpretation of the paper's approach, modified for clarity.
I expect the use of semantic constructs like |
Branchless | bool intersect_branchless(const BBox &bbox, const Ray &ray, float rmin, float rmax) { // Determine bounds x, y, and z double xmin = (bbox.bounds()[ray.sign()[0]].x() - ray.origin().x()) * ray.inv_direction().x(); double xmax = (bbox.bounds()[1 - ray.sign()[0]].x() - ray.origin().x()) * ray.inv_direction().x(); double ymin = (bbox.bounds()[ray.sign()[1]].y() - ray.origin().y()) * ray.inv_direction().y(); double ymax = (bbox.bounds()[1 - ray.sign()[1]].y() - ray.origin().y()) * ray.inv_direction().y(); double zmin = (bbox.bounds()[ray.sign()[2]].z() - ray.origin().z()) * ray.inv_direction().z(); double zmax = (bbox.bounds()[1 - ray.sign()[2]].z() - ray.origin().z()) * ray.inv_direction().z(); // Determine the bounds of the overlapping region double min = std::max({xmin, ymin, zmin}); double max = std::min({xmax, ymax, zmax}); // The ray intercepts if this region overlaps with the interval provided return (max > min) && (min < rmax) && (max > rmin); } | This is the first major modification aiming to improve SIMD performance. Removing the conditionals gives us straight-line code, but eliminates opportunities for early exits. This code should be more readily auto-vectorized by the compiler, but only benchmarks can tell whether that advantage is enough to overcome the lack of early exits in real-world scenarios. |
As I add more implementations, they will be added to this table.
4.3.2 SIMD Prevalence for Different Implementations
For analysis, the prevalence test was restricted to only the relevant function (otherwise the testing framework and its use of the standard library can overwhelm the relevant statistics).
This proved to be the right choice: for example the smits' method program contained 1,957 instructions, but only 67 of those make up the intersection function itself!
4.3.2.1 Smits' Method
Instruction Set | Count | Percent |
---|---|---|
All | 67 | 100.00 |
SSE | 1 | 1.49 |
SSE (Packed) | 1 | 1.49 |
SSE2 | 32 | 47.76 |
SSE2 (Packed) | 5 | 7.46 |
SSE3 | 0 | 0 |
SSSE3 | 0 | 0 |
SSE4 | 0 | 0 |
AVX | 0 | 0 |
This already contains a lot of SIMD instructions, but none above SSE2. Note that non-packed SIMD instructions operate on single values, and not on the entire SSE register. This means that they don't actually introduce parallelism to the code in the way packed instructions do.
Instruction Set | Count | Percent |
---|---|---|
All | 65 | 100.00 |
SSE | 0 | 0 |
SSE (Packed) | 0 | 0 |
SSE2 | 0 | 0 |
SSE2 (Packed) | 0 | 0 |
SSE3 | 0 | 0 |
SSSE3 | 0 | 0 |
SSE4 | 0 | 0 |
AVX | 7 | 10.76 |
Once the compiler is allowed to use newer instructions, it replaces all the SSE and SSE2 instructions with AVX. Though this looks like lower use of SIMD, only 6 of the SSE instructions were packed, while all 7 AVX instructions are packed. The non-packed SSE instructions are only relevant when data is placed in SSE registers. With AVX the same tasks are accomplished with conventional instructions.
Note that in general, it wouldn't be unreasonable to see fewer total SIMD instructions when AVX is enabled. Because modern AVX instructions can operate on larger sets of values at a time, they should be able to accomplish the same tasks using fewer instructions.
4.3.2.2 Improved
Instruction Set | Count | Percent |
---|---|---|
All | 71 | 100.00 |
SSE | 1 | 1.40 |
SSE (Packed) | 1 | 1.40 |
SSE2 | 26 | 36.61 |
SSE2 (Packed) | 2 | 2.81 |
SSE3 | 0 | 0 |
SSSE3 | 0 | 0 |
SSE4 | 0 | 0 |
AVX | 0 | 0 |
The improved version actually contains more instructions, despite its use of precomputed values. It also contains fewer packed SIMD instructions, perhaps due to additional logic precluding their use in certain cases.
When analyzing these results, it's important to be aware of two things:
- The improved version accounts for an edge case not caught by the original.
- More instructions does not always mean slower code (not all instructions are created equal).
Instruction Set | Count | Percent |
---|---|---|
All | 69 | 100.00 |
SSE | 0 | 0 |
SSE (Packed) | 0 | 0 |
SSE2 | 0 | 0 |
SSE2 (Packed) | 0 | 0 |
SSE3 | 0 | 0 |
SSSE3 | 0 | 0 |
SSE4 | 0 | 0 |
AVX | 12 | 17.39 |
Enabling AVX seems to help the improved version. Though it still has more instructions than the original, more of those instructions are SIMD.
Benchmarks will tell what effect this has on performance.
4.3.2.3 Clarified
Instruction Set | Count | Percent |
---|---|---|
All | 67 | 100.00 |
SSE | 1 | 1.49 |
SSE (Packed) | 1 | 1.49 |
SSE2 | 24 | 35.82 |
SSE2 (Packed) | 0 | 0 |
SSE3 | 0 | 0 |
SSSE3 | 0 | 0 |
SSE4 | 0 | 0 |
AVX | 0 | 0 |
This result is actually unexpected: my version of the improved
algorithm was tweaked for readability, and I expected it to result in
identical assembly being emitted.
The new code compiles to fewer instructions, but also makes less use of
SIMD.
In order to see where the differences lie, I'll need take a closer look
at the assembly.
My first guess is that it comes from the additional variables, or my use
of std::max
and std::min
.
Instruction Set | Count | Percent |
---|---|---|
All | 69 | 100.00 |
SSE | 0 | 0 |
SSE (Packed) | 0 | 0 |
SSE2 | 0 | 0 |
SSE2 (Packed) | 0 | 0 |
SSE3 | 0 | 0 |
SSSE3 | 0 | 0 |
SSE4 | 0 | 0 |
AVX | 12 | 17.39 |
After AVX instructions, things look more how I expected. The function appears to compiler to the same assembly as its more verbose predecessor.
What remains a mystery is why enabling AVX would cause this,
as I understand, the -march=native
flag only makes new instructions available to the
compiler, it shouldn't add any new paths for optimization!
4.3.2.4 Branchless
Instruction Set | Count | Percent |
---|---|---|
All | 64 | 100.00 |
SSE | 1 | 1.56 |
SSE (Packed) | 1 | 1.56 |
SSE2 | 21 | 32.81 |
SSE2 (Packed) | 0 | 0 |
SSE3 | 0 | 0 |
SSSE3 | 0 | 0 |
SSE4 | 0 | 0 |
AVX | 0 | 0 |
Similar to the clarified code, this function only contains a single packed SSE function when compiled
without -march=native
!
The only meaningful change made was to remove the conditionals (none of the math was altered),
my hope was that that would unlock more optimizations by the autovectorizer, but clearly that's not
the case.
There are fewer total instructions, but that's to be expected as a result of the reduction in logic.
This doesn't necessarily mean that there's no advantage to the branch-free approach, but any measured advantage would likely be due to reduced branch-mispredictions, and not because of increased SIMD utilization.
Instruction Set | Count | Percent |
---|---|---|
All | 64 | 100.00 |
SSE | 0 | 0 |
SSE (Packed) | 0 | 0 |
SSE2 | 0 | 0 |
SSE2 (Packed) | 0 | 0 |
SSE3 | 0 | 0 |
SSSE3 | 0 | 0 |
SSE4 | 0 | 0 |
AVX | 12 | 18.75 |
Based on the previous table, these results aren't surprising: the branchless function compiles to very similar assembly vs. the other two equivalent functions, minus some conditional logic!
4.4 June 8 - June 15
4.4.1 Collecting Test Data for Benchmarking
Andreas, Pierre, and I discussed the challenges of producing realistic test data when evaluating the performance of an like the intersection function. They proposed extracting scenarios from the tetrahedral remeshing tests, because it's an example of a real package which heavily uses the AABB-tree to shoot rays. This is actually relatively simple to do, by inserting a print statement like the following in CGAL's ray-bbox intersection code:
if (std::is_same<FT, double>::value) std::cout << px << " " << py << " " << pz << " " << qx << " " << qy << " " << qz << " " << bxmin << " " << bymin << " " << bzmin << " " << bxmax << " " << bymax << " " << bzmax << " " << "\n";
(only double types are saved because those are relevant the benchmark)
This is an easy and effective way to capture the real ways that the function is used.
Running test_mesh_and_remesh_polyhedral_domain_with_features.cpp
saved 3,742,217
examples, producing a 376.5 MB text file!
The format isn't particularly pretty or readable, but it's usable enough for parsing back into my
simplified Ray and BBox types.
You may call std::cout.precision(17)
in case that was not called earlier. Andreas
Fabri 08:21, 10 June 2021 (CEST)
- Thanks for the advice, enabling full precision data produced a slightly different intersection frequency! Jackson Campolattaro
4.4.2 Initial Benchmark Results
I created a simple benchmarking program with a function for timing lambda functions using std::chrono
.
Because I'm running this on a laptop, it was important to ensure that
later tests wouldn't have a disadvantage because of rising CPU
temperatures.
To accomplish this, I interleaved the runs of different functions:
I time the process of calculating N scenarios using each implementation,
and repeat that R times, results are averaged.
I found that even N=3742217 (my entire dataset) could be processed in a
reasonable amount of time.
For reducing time variance, I set R=100; this gave me a total runtime
slightly over 30s.
(Unless otherwise indicated, times correspond to a single run through all scenarios)
Implementations | -O3 | -O3 -march=native | Average |
---|---|---|---|
Smits' Method | 5.12781e+07 ms | 5.07237e+07 ms | 5.10E+07 ms |
Improved | 4.50296e+07 ms | 4.34125e+07 ms | 4.42E+07 ms |
Clarified | 4.53892e+07 ms | 4.30072e+07 ms | 4.42E+07 ms |
Branchless | 4.74904e+07 ms | 4.5071e+07 ms | 4.63E+07 ms |
Average | 4.73E+07 ms | 4.56E+07 ms | 4.64E+07 ms |
The averages along the bottom are useful for seeing how compiler flags broadly affect performance, while the averages on the right side are useful for an idea of how each implementation performs, irrespective of flags.
These benchmark results tell us a few things:
-march=native
confers a 3-5% improvement in performance (for this task).- The clarified code has similar performance to the (functionally equivalent) improved code, despite its different assembly.
- The branchless code slightly underperforms the improved and clarified versions, indicating that the early exits that those provide are worthwhile for real data. (The fact that this is so close means that I could have easily gotten the wrong impression, had I used the wrong type of synthetic data!)
Based on lessons learned here, I intend to base future implementations off of the clarified version.
4.4.3 Tweaks to Benchmarks
During our Friday meeting, we discussed ways to open up more opportunities for SIMD parallelism in the benchmarks. One idea was to take into account the fact that each ray is tested against many bounding boxes. Previously our input data contained scenarios that matched one ray against one box, I changed this so that each unique ray is paired with the collection of boxes that it's tested against. Putting the boxes in groups like this may make it possible to perform vectorizations on a larger scale, looking not just at the relationship between one ray and one box, but at how one ray relates to a large set of boxes.
This doesn't map as closely to existing real-life use of the intersection function (bounding boxes are not guaranteed to be adjacent in memory), but it does do a good job of reflecting an ideal scenario, which we should strive towards in order to make the best use of SIMD.
Implementations | -O3 | -O3 -march=native | Average |
---|---|---|---|
Smits' Method | 4.96E+07 ms | 5.06E+07 ms | 5.01E+07 ms |
Improved | 2.85E+07 ms | 2.62E+07 ms | 2.73E+07 ms |
Clarified | 2.76E+07 ms | 2.65E+07 ms | 2.71E+07 ms |
Branchless | 3.28E+07 ms | 3.28E+07 ms | 3.28E+07 ms |
Average | 3.46E+07 ms | 3.40E+07 ms | 3.43E+07 ms |
This simple change produced a massive improvement in performance!
Accross the board, times are cut by more than 25%.
This is a rather spectacular difference, considering that no change was made to the functions
themselves.
4.4.4 Intersection using XSimd
I added another intersection test that uses XSimd to implement the
branchless version with more explicit use of SIMD.
I chose XSimd because of its better usage semantics than std-simd, but I
was disappointed to find that its types don't generalize as nicely for
different platforms.
For example, the xs::batch<double, 4>
type is only available when -march=native
is
used, and only on systems that provide registers of that size
Unlike std-simd, no fallback is provided in other cases (the code will simply fail to compile).
The code works properly, but doesn't perform well. For the standard benchmark, its time averages around 3.99e+07 ms; better than the naive version, but worse than all the others. It has the same number of AVX instructions as the auto-vectorized branchless version, but it also has a lot more instructions in general.
It's certainly possible to beat the compiler by using a library like this, but I get the impression that I don't (yet) have the skill to pull that off.
XSimd in particular seems poorly suited to this type of fine-grained optimization (parallelism between the x, y, z values of a vector, in this case). I get the impression that it's designed for the other type of vectorization (parallelism between multiple vectors).
NOTE: Future benchmarks will be compiled with -march=native
unless otherwise
indicated, because XSimd only supports 4-way batches of doubles when they're supported in hardware.
4.4.5 Switching from Line Segments to Rays
The intersection methods I've been working with take bounds arguments t0
and
t1
, meaning that they are actually testing for intersections with line segments, and
not rays.
In all previous benchmarks, I used values of +/- infinity to generalize to infinite rays.
For future tests I've removed those bounds checks, meaning that the code is now specialized to
infinite rays.
Eliminating the logic came with a corresponding performance advantage.
Implementation | Time |
---|---|
Smits' Method | 4.3852e+07 ms |
Improved | 1.98313e+07 ms |
Clarified | 1.97305e+07 ms |
Branchless | 2.1348e+07 ms |
XSimd | 2.59743e+07 ms |
4.4.6 Effects of inline
Marking a function inline can increase binary sizes, but it has the
benefit of putting each invocation of a function in context.
This is especially important for autovectorization, because it gives the
compiler the ability to recognize natural parallelism on a broader
scale.
Before, vectorization could only happen within the function, exploiting
the paralellism of the x, y, and z coordinates.
Puting the inline
tag on the functions being tested lets the compiler make
optimizations that span across function calls,
this means that it can exploit the parallelism of the outer loop, using SIMD to check a ray against
multiple boxes at once.
I was disappointed by the speed of the branchless implementation in earlier tests, but this is where its advantages shine through. Early exits helped performance before, but -- critically -- they hinder this type of higher-level vectorization. When a function uses different instructions depending on the data its applied to, it naturally can't be applied to batches of data. The branchless version has no such issue, so more coarse-grained optimizations can be applied.
Implementation | Time |
---|---|
Smits' Method | 4.36403e+07 ms |
Improved | 2.01639e+07 ms |
Clarified | 1.97437e+07 ms |
Branchless | 1.3715e+07 ms |
XSimd | 1.8891e+07 ms |
These results leave the question: is this representative of real world usage?
4.5 June 16 - June 22
4.5.1 Restructuring Benchmarks
In preparation for creating a new data arrangement, I spent some time cleaning up the simd experimentation code. The main changes included:
- More narrowly defined responsibilities for the Vector3 and BBox classes.
- Improved deserialization code for loading data from files (including istream operators for each of the custom types).
- Namespaces for each intersection implementation
- Results are formatted as a WikiText-syntax table, so that I can paste directly into the wiki.
4.5.2 Validating benchmarks for Correctness
My original approach for catching logic errors in the different
implementations was to watch the hit-rate of their intersection tests;
this was done by counting the intersections in each benchmark.
Unfortunately, this doesn't catch "symmetrical" errors, where a logic
mistake will result in the same number of spurious intersections as
missed ones.
To resolve this, I began saving results into an std::vector<bool>
, to be compared
at the end of the benchmark.
I actually caught more than one mistake when I added this, but nothing that required additional
complexity to fix
(the mistakes didn't have performance implications).
4.5.2.1 Performance Implications
Unfortunately, adding this validation logic does come with a cost.
Adding each result to the end of the std::vector
consumes some time,
and ultimately this produces significantly slower benchmarks.
Implementation | Time |
---|---|
Smits' Method | 4.95657e+07 ms |
Improved | 3.43701e+07 ms |
Clarified | 3.43883e+07 ms |
Branchless | 3.36523e+07 ms |
XSimd | 4.16125e+07 ms |
Luckily because each benchmark is adding the same number of results to the array, the effect is "flat". The relative performance of the different methods is left unchanged, and so the same conclusions can be reached.
4.5.3 "Struct-of-Arrays" BBox type
A common approach to improve a program's use of SIMD is to arrange the data in a "Struct-of-Arrays" rather than an "Array-of-Structs" format. This means that data can be collected into SIMD registers using aligned load operations, rather than a "gather".
For example, a collection of 8 (x, y, z) vectors stored in the AoS format would look like ...xyzxyzxyzxyzxyzxyzxyzxyz...
in memory;
this makes it easy to load an entire vector at once, but not to load a group of x values as a batch.
On the other hand, the same data in SoA format would look more like xxxxxxxx...yyyyyyyy...zzzzzzzz
.
This makes it easier to operate on groups of x values using simd
registers, but it has the downside that the data associated with a
specific vector is scattered.
I am optimistic that an SoA packing strategy could be used to improve performance on this benchmark, but it's notably less straightforward to apply to CGAL than the AoS strategy already being used. Arranging data in this way is an optimization that punctures the veil of many high-level abstractions.
The use-case I've been exploring for the intersection function compares each ray to a large number of bounding boxes. Because of this, we should be able to apply an SoA philosophy to the boxes only, while keeping the rays AoS.
4.5.3.1 VBBox
My "Vector-Bounding Box" type extends the BBox<T>
type, where T
is an
std::vector
.
The VBBox
has a constructor which takes a vector of
scalar bounding boxes and packs their underlying data into vectors.
The files containing test data are naturally in Array-of-Structs form,
so loading them directly into the new data structures would be awkward.
Instead, I created a function which produces SoA data from already
loaded SoA data.
It also adds an access method getr(std::size_t i)
which retrieves a "cross-sectional
view" of the data structure.
This is another bounding box, where T
is a const T&
.
Because I'm using STL data structures to hold T
inside my classes, I substituted std::reference_wrapper<const
T>
for const T&
.
With the help of this accessor, I can perform intersection tests on individual boxes in the VBBox
without any changes to the scalar intersection code.
4.5.3.2 Benchmark Results
Implementation | Time |
---|---|
Explicit SIMD | 4.53096e+07 ms |
Implicit SIMD | 4.74185e+07 ms |
The fact that the explicit version of the algorithm outperformed the implicit one surprised me. Though the difference is small, it seems like by manually adding simd instructions I found some avenues for optimization that the compiler could not.
Sadly, I was unimpressed with the performance of both AoS methods; both perform worse than the SoA version, while using an equivalent branchless algorithm. My suspicion is that the new arrangement in memory hurts cache performance, which becomes relevant for a computationally-light task like intersection tests.
4.5.3.3 Relevance to Real-life use
Could we use this strategy in CGAL? Knowing that this approach underperforms perhaps makes this question less relevant, but it's still worth looking at.
I believe this isn't actually an option, in order to pack our bounding boxes, we would need to know the relevant ones (the ones we'll be checking for intersections) beforehand. This is easy for some applications (particularly brute force), but to do this in the course of traversing a tree would be tantamount to seeing the future.
4.5.4 Caching Broadcasted Ray
Because our intersection compares one ray to several boxes,
there are many places in the code where it's necessary to apply a scalar operation to many elements,
for example [1.0, 2.0, 3.5, 2.0] * 4.0 = [4.0, 8.0, 14.0, 8.0]
.
In python/numpy terms, this type of elementwise math is known as a
"broadcasted" operation.
xsimd provides overloaded operators that make this easy to do, and
that's how I created the original explicit intersection function.
The catch, here, is that AVX, SSE, et al don't actually support broadcasted operations! The CPU can only perform operations between simd registers. As a result, any time we want to do a broadcast operation we also need to do a "broadcast load", which puts the same value in each slot of the simd register. This is repeated every time the intersection is invoked -- even if the ray is the same -- and the optimizer doesn't appear to notice the repeated loads.
We can fix this by performing the broadcast in the outer loop, and then using that cached result for our intersections. Whenever we start performing intersections with a new ray, we (manually) broadcast the ray's parameters to xsimd registers. These broadcasted values are used for all the relevant intersections.
4.5.4.1 Benchmark Results
Implementation | Time |
---|---|
Explicit SIMD | 4.35466e+07 ms |
Implicit SIMD | 4.61569e+07 ms |
This change appears to have produced a modest performance improvement vs performing the broadcast each time. The explicit code now outperforms the implicit code by a slightly larger margin.
4.6 June 23 - June 30
4.6.1 "Array of Structs of Arrays" BBox
In an "AoSoA" arrangement, the values are grouped together into batches, each batch the size of the
CPU's SIMD registers.
On a CPU with 4-way SIMD, data would be arranged like ...xxxxyyyyzzzzxxxxyyyyzzzzxxxxyyyyzzzz...
in memory.
This strategy has several advantages:
- Data does not need to be loaded from scattered locations when filling a single SIMD register (unlike the AoS strategy).
- Data does not need to be loaded from faraway locations in memory when loading relevant variables into different registers (unlike the SoA strategy, where each x, y, z set can be very far away in memory).
- This model actually lines up closely with the way some proposed improvements to the AABB-tree would arrange the bounding boxes in memory.
It also comes with its own challenges:
- The number of items may not be aligned with the chosen batch size, (e.g. 47 elements won't cleanly fit into a AoSoA data structure with batch size 4). In this case, it's necessary to retain an additional AoS data structure to hold the leftovers.
So what does it look like to implement this for my own BBox type?
Because of my use of templates, the relevant data structure is almost as simple as a std::vector<BBox<xsimd::batch<double,
4>>>
coupled with an std::vector<BBox<double>>
to hold the leftovers.
Most of the challenges in implementation actually came from loading the
data structure during parsing, and I ended up creating a XBBox
type (which extends the BBox<xsimd::batch<>>>` type) to
help with conversion from scalar boxes.
Once the data was packed properly, it became simple to use intersection
methods I had previously written to process the structure (using
explicit or implicit SIMD).
4.6.1.1 Benchmark Results
Implementation | Time |
---|---|
Explicit SIMD | 3.61035e+07 ms |
Implicit SIMD | 3.50253e+07 ms |
Benchmarking shows that this strategy performs better than the SoA model, but doesn't quite beat the original AoS approach.
4.7 July 1 - July 8
4.7.1 Performing Benchmarks on Deepsat
Deepsat is a Xeon server that I and other members of the Inria TITANE team share access to. After arriving at the Inria campus, I decided to re-run my benchmarks on the machine as a way to build familiarity with the workflow here. Most of the server's advantages aren't relevant to my test case: the 20 CPU cores don't benefit my single-threaded program, and the 132Gb of ram aren't necessary for my 900Mb dataset. What's important to me is the consistency the machine can provide. Whereas my laptop is thermally limited, the server has the ability to run benchmark after benchmark with no decrease to its performance!
I connected to Deepsat over SSH, and recompiled the solutions to take full advantage of the modern Xeon CPU's instruction set. Because my benchmarks contain their own timing and result-reporting logic, producing the following tables was as simple as running each in turn.
4.7.1.1 Results
Implementation | Time |
---|---|
Smits' Method | 6.38517e+07 ms |
Improved | 3.70988e+07 ms |
Clarified | 3.65901e+07 ms |
Branchless | 3.68068e+07 ms |
XSimd | 4.14903e+07 ms |
Implementation | Time |
---|---|
Explicit SIMD | 5.93713e+07 ms |
Implicit SIMD | 4.44437e+07 ms |
Implementation | Time |
---|---|
Explicit SIMD | 3.96977e+07 ms |
Implicit SIMD | 3.93736e+07 ms |
Performance was slightly worse across the board, but this was expected; Deepsat's Xeon CPU excels at embarrassingly parallel workloads processing massive datasets, and this benchmark is neither of those things. Consumer hardware is optimized for single-core speed, and that's an advantage for this type of work.
Despite its differences, this data leads us to the same conclusions as previous results.
4.7.2 Experimenting with Application of OpenMP Directives
One approach for SIMD that we haven't discussed so far is OpenMP. This is because I've been having trouble getting interesting results from it.
Simply put, I've yet to run into a case where adding a #pragma omp ...
tag resulted in
the compiler making any different choices.
Adding it to the benchmark code didn't affect performance, so to confirm my suspicions I created a
simple program
similar to IBM's
example of how to apply OpenMP for SIMD.
// Naive function float func_naive(float a, float b) { return std::max(a, b); } // Openmp-vectorized function #pragma omp declare simd float func_openmp(float a, float b) { return std::max(a, b); }
These implementations give us the following results:
Naive: 0.677483s Openmp: 0.554589s
The OpenMP tag does give us some improvement, but the difference is marginal (and I haven't yet confirmed how much of that comes from not interleaving the benchmarks).
// Naive function inline float func_naive(float a, float b) { return std::max(a, b); } // Openmp-vectorized function #pragma omp declare simd inline float func_openmp(float a, float b) { return std::max(a, b); }
Naive: 0.730367 Openmp: 0.571762
Inlining both functions doesn't result in significant changes to performance; they're small enough that they were likely already effectively inlined by the compiler.
4.7.3 Profiling Mesh_3 remesh_polyhedral_surface_sm
On Andreas' suggestion, I will be using Mesh_3's remesh_polyhedral_surface_sm
as a
representation of typical usage of the AABB-tree in future tests.
I started by using the profiling tool Perf to identify "hot points" in the program. This is important, because it helps us determine where our efforts are best applied if we want to improve performance. Up until now, I've been using ray-bbox intersection as a target for optimization, under the assumption that because that operation is necessary very often during traversal it must be important to performance. My hope is that a significant amount of time is spent on checking for these intersections, because it means that my work up until now has the potential to be more impactful.
Luckily, profiling results seemed to confirm this prediction:
348 of Perf's samples were during the Mesh refinement operation (most of the remaining were during
data-loading).
Of those, 155 were inside the ray-bbox intersection code (CGAL::Intersections::internal::do_intersect_bbox_segment_aux
).
This
tells us that approximately 44.5% of the time performing the remeshing
operation is spent just on checking for intersections between rays and
bounding boxes, while traversing the tree!
4.7.4 Branchless do_intersect_bbox_segment_aux
The ray-bbox intersection reference implementation has been a good way of experimenting with different ideas and learning about SIMD techniques, but continuing to optimize it encounters diminishing returns. It's begun to make sense to experiment with applying similar techniques to real CGAL code. With that in mind, I evaluated making CGAL's own ray-bbox intersection code branchless. This is a good first step because it's closely related to the work I've done so far, and future optimizations like child-skipping can benefit from such a change.
I started by creating a
new branch on my own fork of CGAL
dedicated to the branchless intersection change,
I'm planning to create branches dedicated to many tests like this one,
so it doesn't make sense to add all of these branches to cgal-public-dev
(instead I'll merge them into my branch there if they produce good
results).
Next, I annotated
the relevant function
to build familiarity.
CGAL's intersection guarantees exactness, but doesn't lean on the
inherent exactness of the kernel.
Instead, it coerces the floating point type to double and does the
intersection math in double precision, but uses a custom class to keep
track of the possible accumulated error.
It uses that class to determine the margin it needs to be certain of its
result, and has a return type of Uncertain<bool>
, so
that it can return an indeterminate result if the values fall within
the margin.
The result is an intersection function significantly more elaborate than
the ones I've been working with, totaling over 350 lines.
The existing function was not easily amenable to a branchless
structure.
I especially had trouble with the way it mixes conditional logic that's
based on template parameters, and logic based on arguments.
For example, in if(bounded_1 && qx > bxmax) return false;
,
bounded_1
is a template parameter, so conditional logic which depends on it can
be evaluated as a constexpr (resulting in branchless assembly).
qx > bxmax
does not have the same advantage, so for instantiations
bounded_1
is true this becomes an early exit.
This type of structuring makes perfect sense in context, but doesn't map
cleanly to a branchless alternative.
More challenging, a lot of the logic later in the code depends on the
guarantees provided by early exits.
This means that eliminating branches has to be done holistically, rather
than as a series of local transformations to the function's logic.
For a simpler test, I replaced the entire existing code with a simple version adapted from my previous branchless code.
template< typename FT, typename BFT, bool bounded_0, bool bounded_1, bool use_static_filters > inline typename Do_intersect_bbox_segment_aux_is_greater< FT, bounded_0, use_static_filters >::result_type do_intersect_bbox_segment_aux( const FT &px, const FT &py, const FT &pz, const FT &qx, const FT &qy, const FT &qz, const BFT &bxmin, const BFT &bymin, const BFT &bzmin, const BFT &bxmax, const BFT &bymax, const BFT &bzmax) { const BFT bx[] = {bxmin, bxmax}; const BFT by[] = {bymin, bymax}; const BFT bz[] = {bzmin, bzmax}; const int sx = qx < 0; const int sy = qy < 0; const int sz = qz < 0; // Determine bounds x, y, and z FT xmin = (bx[sx] - px) / (qx - px); FT xmax = (bx[1 - sx] - px) / (qx - px); FT ymin = (by[sy] - py) / (qy - py); FT ymax = (by[1 - sy] - py) / (qy - py); FT zmin = (bz[sz] - pz) / (qz - pz); FT zmax = (bz[1 - sz] - pz) / (qz - pz); // Determine the bounds of the overlapping region FT min = std::max({xmin, ymin, zmin}); FT max = std::min({xmax, ymax, zmax}); // The ray intercepts if this region overlaps with the interval provided return (max >= min) && // Check if there is any overlap at all !(bounded_0 && max < 0) && // Check if the overlap is outside the ray (before the start) !(bounded_1 && min > 1); // Check if the overlap is outside the ray (after the end) }
I also added a simple parity check against the old method, when it's not uncertain. This gives me some confidence in this code's correctness, though the dataset tested probably doesn't hit every edge case.
This version uses the full precision types provided by the kernel, meaning that it will never return an uncertain value. This comes with a performance penalty (especially for the kernels with the strongest guarantees of exactness), but it massively simplifies the logic, and may improve performance in real tasks. This is because in the old code, uncertain results were assumed to be hits, resulting in looking at more nodes than strictly necessary.
Simple benchmarking was done using the time
command on single runs of the compiled
program.
Using the old solution, I get performance that tends to look like:
time ~/Documents/cgal-public-dev/Mesh_3/examples/Mesh_3/cmake-build-release/remesh_polyhedral_surface_sm real 0m0.461s user 0m0.453s sys 0m0.006s
Once the optimization is applied, the program is much faster:
time ~/Documents/cgal-public-dev/Mesh_3/examples/Mesh_3/cmake-build-release/remesh_polyhedral_surface_sm real 0m0.181s user 0m0.172s sys 0m0.005s
The actual improvement is actually much larger than this might suggest, my profiler indicates that the number of samples during traversal went from 182 to just 18 (most of the time is spent constructing the tree). It's challenging to determine how much of this improvement comes from the simpler, faster intersection function, and how much comes from the reduction of spurious intersections. In any case, the complexity of the kernel's exact floating point type means that it's unlikely that this change actually meaningfully increased the use of SIMD.
4.7.5 Meeting with Adrien Cassagne
Adrien Cassagne is a researcher elsewhere at Inria with work that makes extensive use of SIMD. Most relevant is his paper on MIPP, a custom SIMD wrapper library which he wrote as part of another line of research. Pierre contacted Adrien, and was able to schedule a meeting for the 6th of July. In preparation for our meeting, I prepared a set of slides to explain the problem statement.
Adrien's experience turned out to be relevant to our work, and he made some salient points about our approach so far. He said that wrapper libraries seemed like a good level of abstraction for our needs, which is the same conclusion we came to. He was surprised that our "AoS" and "AoSoA" implementations weren't faster (especially given our high ratio of memory-heavy operations to numerically intense ones), and he offered to examine the test code to try and understand why it would underperform.
Adrien agreed that SIMD was a worthwhile endeavor for CGAL, and he expects to have extra free time until September, so he's interested in arranging more meetings to help.
He also made several suggestions about how to best vectorize the AABB-tree, though he noted that he's never vectorized a tree structure like this before. One that interested me was his mention of stream traversal, performing queries with groups of similar rays simultaneously. This is something that Embree does, so it caught my attention that he independently proposed the same optimization. Stream traversal would mesh well with CGAL's existing abstractions (it could be as simple as defining a new query type which bundles several ray queries), but the catch is that it works best when the rays being cast are very similar. In rendering this is as done by casting rays of adjacent pixels, but for CGAL it's difficult to think of a relevant use-case. Do any of CGAL's packages have a tendency to perform queries of similar rays?
4.7.6 Branchless BBox-BBox Intersection
CGAL's BBox-BBox intersection function is significantly simpler than Ray-BBox intersection.
This is because bounding boxes are axis-aligned, so a pair of boxes intersect whenever their x, y,
and z bounds all intersect.
Because all of the comparisons are along axes there isn't any opportunity for accumulated error,
this means all the math can be done using the double
type without any additional logic
needed to add margins.
As you'll see in the assembly, making the logic branchless isn't as simple as combining the logic into a single expression.
Approach | Code | Assembly |
---|---|---|
Original | bool do_overlap(const Bbox_3& bb1, const Bbox_3& bb2) { // check for emptiness ?? if (bb1.xmax() < bb2.xmin() || bb2.xmax() < bb1.xmin()) return false; if (bb1.ymax() < bb2.ymin() || bb2.ymax() < bb1.ymin()) return false; if (bb1.zmax() < bb2.zmin() || bb2.zmax() < bb1.zmin()) return false; return true; } | 00000000004012b0 <CGAL::do_overlap(CGAL::Bbox_3 const&, CGAL::Bbox_3 const&)>: 4012b0: f2 0f 10 06 movsd (%rsi),%xmm0 4012b4: 31 c0 xor %eax,%eax 4012b6: 66 0f 2f 47 18 comisd 0x18(%rdi),%xmm0 4012bb: /-- 77 3c ja 4012f9 <CGAL::do_overlap(CGAL::Bbox_3 const&, CGAL::Bbox_3 const&)+0x49> 4012bd: | f2 0f 10 07 movsd (%rdi),%xmm0 4012c1: | 66 0f 2f 46 18 comisd 0x18(%rsi),%xmm0 4012c6: +-- 77 31 ja 4012f9 <CGAL::do_overlap(CGAL::Bbox_3 const&, CGAL::Bbox_3 const&)+0x49> 4012c8: | f2 0f 10 46 08 movsd 0x8(%rsi),%xmm0 4012cd: | 66 0f 2f 47 20 comisd 0x20(%rdi),%xmm0 4012d2: +-- 77 25 ja 4012f9 <CGAL::do_overlap(CGAL::Bbox_3 const&, CGAL::Bbox_3 const&)+0x49> 4012d4: | f2 0f 10 47 08 movsd 0x8(%rdi),%xmm0 4012d9: | 66 0f 2f 46 20 comisd 0x20(%rsi),%xmm0 4012de: +-- 77 19 ja 4012f9 <CGAL::do_overlap(CGAL::Bbox_3 const&, CGAL::Bbox_3 const&)+0x49> 4012e0: | f2 0f 10 46 10 movsd 0x10(%rsi),%xmm0 4012e5: | 66 0f 2f 47 28 comisd 0x28(%rdi),%xmm0 4012ea: +-- 77 0d ja 4012f9 <CGAL::do_overlap(CGAL::Bbox_3 const&, CGAL::Bbox_3 const&)+0x49> 4012ec: | f2 0f 10 47 10 movsd 0x10(%rdi),%xmm0 4012f1: | 66 0f 2f 46 28 comisd 0x28(%rsi),%xmm0 4012f6: | 0f 96 c0 setbe %al 4012f9: \-> c3 retq |
It's immediately clear the original code is riddled with early exits. These early exits may or may not be helpful for performance during sequential operation, but they pose a problem when we want to vectorize this code. | ||
Single-expression | bool do_overlap(const Bbox_3& bb1, const Bbox_3& bb2) { // Only return true if there's overlap on every axis return !(bb1.xmax() < bb2.xmin() || bb2.xmax() < bb1.xmin()) && // No overlap on x !(bb1.ymax() < bb2.ymin() || bb2.ymax() < bb1.ymin()) && // No overlap on y !(bb1.zmax() < bb2.zmin() || bb2.zmax() < bb1.zmin()); // No overlap on z } | 00000000004012b0 <CGAL::do_overlap(CGAL::Bbox_3 const&, CGAL::Bbox_3 const&)>: 4012b0: f2 0f 10 06 movsd (%rsi),%xmm0 4012b4: 31 c0 xor %eax,%eax 4012b6: 66 0f 2f 47 18 comisd 0x18(%rdi),%xmm0 4012bb: /-- 77 3c ja 4012f9 <CGAL::do_overlap(CGAL::Bbox_3 const&, CGAL::Bbox_3 const&)+0x49> 4012bd: | f2 0f 10 07 movsd (%rdi),%xmm0 4012c1: | 66 0f 2f 46 18 comisd 0x18(%rsi),%xmm0 4012c6: +-- 77 31 ja 4012f9 <CGAL::do_overlap(CGAL::Bbox_3 const&, CGAL::Bbox_3 const&)+0x49> 4012c8: | f2 0f 10 46 08 movsd 0x8(%rsi),%xmm0 4012cd: | 66 0f 2f 47 20 comisd 0x20(%rdi),%xmm0 4012d2: +-- 77 25 ja 4012f9 <CGAL::do_overlap(CGAL::Bbox_3 const&, CGAL::Bbox_3 const&)+0x49> 4012d4: | f2 0f 10 47 08 movsd 0x8(%rdi),%xmm0 4012d9: | 66 0f 2f 46 20 comisd 0x20(%rsi),%xmm0 4012de: +-- 77 19 ja 4012f9 <CGAL::do_overlap(CGAL::Bbox_3 const&, CGAL::Bbox_3 const&)+0x49> 4012e0: | f2 0f 10 46 10 movsd 0x10(%rsi),%xmm0 4012e5: | 66 0f 2f 47 28 comisd 0x28(%rdi),%xmm0 4012ea: +-- 77 0d ja 4012f9 <CGAL::do_overlap(CGAL::Bbox_3 const&, CGAL::Bbox_3 const&)+0x49> 4012ec: | f2 0f 10 47 10 movsd 0x10(%rdi),%xmm0 4012f1: | 66 0f 2f 46 28 comisd 0x28(%rsi),%xmm0 4012f6: | 0f 96 c0 setbe %al 4012f9: \-> c3 retq |
Refactoring into a single expression may seem like it should result in branchless operation -- after all, it's only producing a single boolean and returning it -- but for reasons I'll explain shortly it produces exactly the same assembly as the original version! | ||
"Branchless" | bool do_overlap(const Bbox_3& bb1, const Bbox_3& bb2) { // Determine how much overlap exists in each axis double x_overlap = std::min(bb1.xmax(), bb2.xmax()) - std::max(bb1.xmin(), bb2.xmin()); double y_overlap = std::min(bb1.ymax(), bb2.ymax()) - std::max(bb1.ymin(), bb2.ymin()); double z_overlap = std::min(bb1.zmax(), bb2.zmax()) - std::max(bb1.zmin(), bb2.zmin()); // Only return true if there's overlap on every axis return x_overlap >= 0 && y_overlap >= 0 && z_overlap >= 0; } | 00000000004012b0 <CGAL::do_overlap(CGAL::Bbox_3 const&, CGAL::Bbox_3 const&)>: 4012b0: f2 0f 10 46 18 movsd 0x18(%rsi),%xmm0 4012b5: f2 0f 10 0e movsd (%rsi),%xmm1 4012b9: 66 0f ef ed pxor %xmm5,%xmm5 4012bd: f2 0f 5f 0f maxsd (%rdi),%xmm1 4012c1: f2 0f 5d 47 18 minsd 0x18(%rdi),%xmm0 4012c6: f2 0f 10 5e 08 movsd 0x8(%rsi),%xmm3 4012cb: f2 0f 10 56 28 movsd 0x28(%rsi),%xmm2 4012d0: f2 0f 10 66 10 movsd 0x10(%rsi),%xmm4 4012d5: f2 0f 5f 5f 08 maxsd 0x8(%rdi),%xmm3 4012da: f2 0f 5c c1 subsd %xmm1,%xmm0 4012de: f2 0f 5d 57 28 minsd 0x28(%rdi),%xmm2 4012e3: f2 0f 10 4e 20 movsd 0x20(%rsi),%xmm1 4012e8: f2 0f 5f 67 10 maxsd 0x10(%rdi),%xmm4 4012ed: f2 0f 5d 4f 20 minsd 0x20(%rdi),%xmm1 4012f2: 66 0f 2f c5 comisd %xmm5,%xmm0 4012f6: /-- 72 18 jb 401310 <CGAL::do_overlap(CGAL::Bbox_3 const&, CGAL::Bbox_3 const&)+0x60> 4012f8: | f2 0f 5c cb subsd %xmm3,%xmm1 4012fc: | 66 0f 2f cd comisd %xmm5,%xmm1 401300: +-- 72 0e jb 401310 <CGAL::do_overlap(CGAL::Bbox_3 const&, CGAL::Bbox_3 const&)+0x60> 401302: | f2 0f 5c d4 subsd %xmm4,%xmm2 401306: | 66 0f 2f d5 comisd %xmm5,%xmm2 40130a: | 0f 93 c0 setae %al 40130d: | c3 retq 40130e: | 66 90 xchg %ax,%ax 401310: \-> 31 c0 xor %eax,%eax 401312: c3 retq |
This version replaces the original approach with something that looks at the problem "intuitively". It does reduce the number of early exits, but it still doesn't actually produce branchless logic. Looking at the code, it's not entirely clear where our conditional behavior comes from. The catch: C++'s short circuit evaluation introduces jumps!
In our logical expression I've found short-circuiting behavior helpful only on occasion,
and I hadn't considered the degree to which it hamstrings the compiler.
Even though | ||
Explicit branchless (intermediate values) | bool do_overlap(const Bbox_3& bb1, const Bbox_3& bb2) { // Determine how much overlap exists in each axis double x_overlap = std::min(bb1.xmax(), bb2.xmax()) - std::max(bb1.xmin(), bb2.xmin()); double y_overlap = std::min(bb1.ymax(), bb2.ymax()) - std::max(bb1.ymin(), bb2.ymin()); double z_overlap = std::min(bb1.zmax(), bb2.zmax()) - std::max(bb1.zmin(), bb2.zmin()); // Determine whether each axis has a positive overlap bool x_intersect = x_overlap >= 0; bool y_intersect = y_overlap >= 0; bool z_intersect = z_overlap >= 0; // Only intersects if there's overlap on every axis bool xy_intersect = x_intersect && y_intersect; bool intersect = xy_intersect && z_intersect; return intersect; } | 00000000004012b0 <CGAL::do_overlap(CGAL::Bbox_3 const&, CGAL::Bbox_3 const&)>: 4012b0: f2 0f 10 56 18 movsd 0x18(%rsi),%xmm2 4012b5: f2 0f 10 0e movsd (%rsi),%xmm1 4012b9: f2 0f 5d 57 18 minsd 0x18(%rdi),%xmm2 4012be: f2 0f 5f 0f maxsd (%rdi),%xmm1 4012c2: f2 0f 10 5e 08 movsd 0x8(%rsi),%xmm3 4012c7: f2 0f 5f 5f 08 maxsd 0x8(%rdi),%xmm3 4012cc: f2 0f 10 46 28 movsd 0x28(%rsi),%xmm0 4012d1: f2 0f 10 66 10 movsd 0x10(%rsi),%xmm4 4012d6: f2 0f 5c d1 subsd %xmm1,%xmm2 4012da: f2 0f 10 4e 20 movsd 0x20(%rsi),%xmm1 4012df: f2 0f 5d 4f 20 minsd 0x20(%rdi),%xmm1 4012e4: f2 0f 5d 47 28 minsd 0x28(%rdi),%xmm0 4012e9: f2 0f 5f 67 10 maxsd 0x10(%rdi),%xmm4 4012ee: f2 0f 5c cb subsd %xmm3,%xmm1 4012f2: 66 0f ef db pxor %xmm3,%xmm3 4012f6: 66 0f 2f d3 comisd %xmm3,%xmm2 4012fa: f2 0f 5c c4 subsd %xmm4,%xmm0 4012fe: 0f 93 c0 setae %al 401301: 66 0f 2f cb comisd %xmm3,%xmm1 401305: 0f 93 c2 setae %dl 401308: 21 d0 and %edx,%eax 40130a: 66 0f 2f c3 comisd %xmm3,%xmm0 40130e: 0f 93 c2 setae %dl 401311: 21 d0 and %edx,%eax 401313: c3 retq |
Here, I use additional variables to ensure that intermediate values are actually computed. This is the first version to produce true straight-line assembly! This tends to be the best-practice for this type of thing, in this case the resulting code is terribly unclear. Even if this isn't how the assembly does things, it's preferable to treat x, y, and z symmetrically. | ||
Explicit branchless (boolean operations) | bool do_overlap(const Bbox_3& bb1, const Bbox_3& bb2) { // Determine how much overlap exists in each axis double x_overlap = std::min(bb1.xmax(), bb2.xmax()) - std::max(bb1.xmin(), bb2.xmin()); double y_overlap = std::min(bb1.ymax(), bb2.ymax()) - std::max(bb1.ymin(), bb2.ymin()); double z_overlap = std::min(bb1.zmax(), bb2.zmax()) - std::max(bb1.zmin(), bb2.zmin()); // Determine whether each axis has a positive overlap bool x_intersect = x_overlap >= 0; bool y_intersect = y_overlap >= 0; bool z_intersect = z_overlap >= 0; // Only intersects if there's overlap on every axis bool intersect = x_intersect & y_intersect & z_intersect; return intersect; } | 00000000004012b0 <CGAL::do_overlap(CGAL::Bbox_3 const&, CGAL::Bbox_3 const&)>: 4012b0: f2 0f 10 56 18 movsd 0x18(%rsi),%xmm2 4012b5: f2 0f 10 0e movsd (%rsi),%xmm1 4012b9: f2 0f 5d 57 18 minsd 0x18(%rdi),%xmm2 4012be: f2 0f 5f 0f maxsd (%rdi),%xmm1 4012c2: f2 0f 10 5e 08 movsd 0x8(%rsi),%xmm3 4012c7: f2 0f 5f 5f 08 maxsd 0x8(%rdi),%xmm3 4012cc: f2 0f 10 46 28 movsd 0x28(%rsi),%xmm0 4012d1: f2 0f 10 66 10 movsd 0x10(%rsi),%xmm4 4012d6: f2 0f 5c d1 subsd %xmm1,%xmm2 4012da: f2 0f 10 4e 20 movsd 0x20(%rsi),%xmm1 4012df: f2 0f 5d 4f 20 minsd 0x20(%rdi),%xmm1 4012e4: f2 0f 5d 47 28 minsd 0x28(%rdi),%xmm0 4012e9: f2 0f 5f 67 10 maxsd 0x10(%rdi),%xmm4 4012ee: f2 0f 5c cb subsd %xmm3,%xmm1 4012f2: 66 0f ef db pxor %xmm3,%xmm3 4012f6: 66 0f 2f d3 comisd %xmm3,%xmm2 4012fa: f2 0f 5c c4 subsd %xmm4,%xmm0 4012fe: 0f 93 c0 setae %al 401301: 66 0f 2f cb comisd %xmm3,%xmm1 401305: 0f 93 c2 setae %dl 401308: 21 d0 and %edx,%eax 40130a: 66 0f 2f c3 comisd %xmm3,%xmm0 40130e: 0f 93 c2 setae %dl 401311: 21 d0 and %edx,%eax 401313: c3 retq |
Using bitwise operators produces identical assembly, but lets us preserve code structure. Both this solution and the previous solution suffer the same readability issue: someone who isn't familiar with the reason behind the "ugly" code is liable to clean it up, reintroducing the short circuits. This issue can be resolved by surrounding the code with explanatory comments. With that in mind I prefer this solution, because it's more easily readable at a glance. | ||
Legible branchless | bool do_overlap(const Bbox_3& bb1, const Bbox_3& bb2) { // Only return true if there's overlap on every axis return !(bb1.xmax() < bb2.xmin() | bb2.xmax() < bb1.xmin()) & // No overlap on x !(bb1.ymax() < bb2.ymin() | bb2.ymax() < bb1.ymin()) & // No overlap on y !(bb1.zmax() < bb2.zmin() | bb2.zmax() < bb1.zmin()); // No overlap on z } | 00000000004012b0 <CGAL::do_overlap(CGAL::Bbox_3 const&, CGAL::Bbox_3 const&)>: 4012b0: f2 0f 10 06 movsd (%rsi),%xmm0 4012b4: 66 0f 2f 47 18 comisd 0x18(%rdi),%xmm0 4012b9: f2 0f 10 07 movsd (%rdi),%xmm0 4012bd: 0f 97 c0 seta %al 4012c0: 66 0f 2f 46 18 comisd 0x18(%rsi),%xmm0 4012c5: f2 0f 10 46 08 movsd 0x8(%rsi),%xmm0 4012ca: 0f 97 c2 seta %dl 4012cd: 09 d0 or %edx,%eax 4012cf: 66 0f 2f 47 20 comisd 0x20(%rdi),%xmm0 4012d4: f2 0f 10 47 08 movsd 0x8(%rdi),%xmm0 4012d9: 0f 97 c2 seta %dl 4012dc: 09 d0 or %edx,%eax 4012de: 66 0f 2f 46 20 comisd 0x20(%rsi),%xmm0 4012e3: f2 0f 10 46 10 movsd 0x10(%rsi),%xmm0 4012e8: 0f 97 c2 seta %dl 4012eb: 09 d0 or %edx,%eax 4012ed: 66 0f 2f 47 28 comisd 0x28(%rdi),%xmm0 4012f2: f2 0f 10 47 10 movsd 0x10(%rdi),%xmm0 4012f7: 0f 97 c2 seta %dl 4012fa: 09 d0 or %edx,%eax 4012fc: 66 0f 2f 46 28 comisd 0x28(%rsi),%xmm0 401301: 0f 97 c2 seta %dl 401304: 09 d0 or %edx,%eax 401306: 83 f0 01 xor $0x1,%eax 401309: c3 retq |
Now that we know a solution to the root issue, we can apply it to our first approach. This code contains none of the jumps that approach had, despite appearing almost identical! |
This process illustrates the unique challenges that come with making complex logic branchless. Problems are often well-hidden, and solutions unintuitive. This is related to one of the advantages that a library like xsimd presents: it's explicitness means that we can have greater confidence in the structure of the produced assembly, without needing to examine it directly.
4.8 July 9 - July 16
4.8.1 Truly Branchless do_intersect_bbox_segment_aux
Earlier I said that the FT type used by the new ray-bbox intersection
function was likely too elaborate to be effectively vectorized.
Further investigation revealed that in the case of the program I'm using
as a benchmark, FT
is defined as double
!
This can't be assumed to be true for most Kernels, but in this
particular case it provides an opportunity to look more closely at how
the function gets optimized.
Approach | Code | Assembly |
---|---|---|
With Short-circuits | template< typename FT, typename BFT, bool bounded_0, bool bounded_1, bool use_static_filters > typename Do_intersect_bbox_segment_aux_is_greater< FT, bounded_0, use_static_filters >::result_type do_intersect_bbox_segment_aux( const FT &px, const FT &py, const FT &pz, const FT &qx, const FT &qy, const FT &qz, const BFT &bxmin, const BFT &bymin, const BFT &bzmin, const BFT &bxmax, const BFT &bymax, const BFT &bzmax) { const BFT bx[] = {bxmin, bxmax}; const BFT by[] = {bymin, bymax}; const BFT bz[] = {bzmin, bzmax}; const int sx = qx < 0; const int sy = qy < 0; const int sz = qz < 0; // Determine bounds x, y, and z FT xmin = (bx[sx] - px) / (qx - px); FT xmax = (bx[1 - sx] - px) / (qx - px); FT ymin = (by[sy] - py) / (qy - py); FT ymax = (by[1 - sy] - py) / (qy - py); FT zmin = (bz[sz] - pz) / (qz - pz); FT zmax = (bz[1 - sz] - pz) / (qz - pz); // Determine the bounds of the overlapping region FT min = std::max({xmin, ymin, zmin}); FT max = std::min({xmax, ymax, zmax}); // The ray intercepts if this region overlaps with the interval provided Uncertain<bool> new_result = (max >= min) && // Check if there is any overlap at all !(bounded_0 && max < 0) && // Check if the overlap is outside the ray (before the start) !(bounded_1 && min > 1); // Check if the overlap is outside the ray (after the end) return new_result; } | 000000000040c480 <CCGAL::Intersections::internal::do_intersect_bbox_segment_aux<...>(...)>: 40c480: f2 0f 11 7c 24 e0 movsd %xmm7,-0x20(%rsp) 40c486: 66 0f ef ff pxor %xmm7,%xmm7 40c48a: 31 c0 xor %eax,%eax 40c48c: 66 44 0f 28 c0 movapd %xmm0,%xmm8 40c491: 66 0f 2f fb comisd %xmm3,%xmm7 40c495: f2 0f 10 44 24 10 movsd 0x10(%rsp),%xmm0 40c49b: f2 0f 11 74 24 d0 movsd %xmm6,-0x30(%rsp) 40c4a1: 66 0f 28 f3 movapd %xmm3,%xmm6 40c4a5: f2 41 0f 5c f0 subsd %xmm8,%xmm6 40c4aa: f2 0f 11 44 24 d8 movsd %xmm0,-0x28(%rsp) 40c4b0: f2 0f 10 44 24 18 movsd 0x18(%rsp),%xmm0 40c4b6: 0f 97 c0 seta %al 40c4b9: 31 f6 xor %esi,%esi 40c4bb: 66 0f 2f fc comisd %xmm4,%xmm7 40c4bf: f2 0f 11 44 24 e8 movsd %xmm0,-0x18(%rsp) 40c4c5: 48 89 c7 mov %rax,%rdi 40c4c8: f2 0f 5c e1 subsd %xmm1,%xmm4 40c4cc: f2 0f 10 44 24 08 movsd 0x8(%rsp),%xmm0 40c4d2: f2 0f 10 5c c4 d0 movsd -0x30(%rsp,%rax,8),%xmm3 40c4d8: b8 01 00 00 00 mov $0x1,%eax 40c4dd: 89 c2 mov %eax,%edx 40c4df: 40 0f 97 c6 seta %sil 40c4e3: 31 c9 xor %ecx,%ecx 40c4e5: f2 0f 11 44 24 f0 movsd %xmm0,-0x10(%rsp) 40c4eb: f2 0f 10 44 24 20 movsd 0x20(%rsp),%xmm0 40c4f1: 66 0f 2f fd comisd %xmm5,%xmm7 40c4f5: f2 41 0f 5c d8 subsd %xmm8,%xmm3 40c4fa: f2 0f 5c ea subsd %xmm2,%xmm5 40c4fe: f2 0f 11 44 24 f8 movsd %xmm0,-0x8(%rsp) 40c504: f2 0f 5e de divsd %xmm6,%xmm3 40c508: 0f 97 c1 seta %cl 40c50b: 29 fa sub %edi,%edx 40c50d: 48 63 d2 movslq %edx,%rdx 40c510: f2 0f 10 44 d4 d0 movsd -0x30(%rsp,%rdx,8),%xmm0 40c516: 48 63 d6 movslq %esi,%rdx 40c519: f2 41 0f 5c c0 subsd %xmm8,%xmm0 40c51e: f2 44 0f 10 44 d4 e0 movsd -0x20(%rsp,%rdx,8),%xmm8 40c525: 89 c2 mov %eax,%edx 40c527: 29 c8 sub %ecx,%eax 40c529: 29 f2 sub %esi,%edx 40c52b: 48 98 cltq 40c52d: 48 63 d2 movslq %edx,%rdx 40c530: f2 44 0f 5c c1 subsd %xmm1,%xmm8 40c535: f2 0f 5e c6 divsd %xmm6,%xmm0 40c539: f2 0f 10 74 d4 e0 movsd -0x20(%rsp,%rdx,8),%xmm6 40c53f: 48 63 d1 movslq %ecx,%rdx 40c542: f2 0f 5c f1 subsd %xmm1,%xmm6 40c546: f2 0f 10 4c c4 f0 movsd -0x10(%rsp,%rax,8),%xmm1 40c54c: f2 0f 5c ca subsd %xmm2,%xmm1 40c550: f2 44 0f 5e c4 divsd %xmm4,%xmm8 40c555: f2 0f 5e f4 divsd %xmm4,%xmm6 40c559: f2 0f 10 64 d4 f0 movsd -0x10(%rsp,%rdx,8),%xmm4 40c55f: 31 d2 xor %edx,%edx 40c561: f2 0f 5c e2 subsd %xmm2,%xmm4 40c565: f2 44 0f 5f c3 maxsd %xmm3,%xmm8 40c56a: f2 0f 5e e5 divsd %xmm5,%xmm4 40c56e: f2 0f 5d f0 minsd %xmm0,%xmm6 40c572: f2 0f 5e cd divsd %xmm5,%xmm1 40c576: f2 41 0f 5f e0 maxsd %xmm8,%xmm4 40c57b: f2 0f 5d ce minsd %xmm6,%xmm1 40c57f: 66 0f 2f cc comisd %xmm4,%xmm1 40c583: /-- 72 07 jb 40c58c <CGAL::Intersections::internal::do_intersect_bbox_segment_aux<...>(...)+0x10c> 40c585: | 66 0f 2f f9 comisd %xmm1,%xmm7 40c589: | 0f 96 c2 setbe %dl 40c58c: \-> 31 c0 xor %eax,%eax 40c58e: 88 d0 mov %dl,%al 40c590: 88 d4 mov %dl,%ah 40c592: c3 retq |
My previous work produced something that was nearly branchless, but clearly it wasn't quite perfect! The short-circuiting behavior causes the same problems it did in my last experience, but we can fix it with the same solution. | ||
Short-circuits removed | template< typename FT, typename BFT, bool bounded_0, bool bounded_1, bool use_static_filters > typename Do_intersect_bbox_segment_aux_is_greater< FT, bounded_0, use_static_filters >::result_type do_intersect_bbox_segment_aux( const FT &px, const FT &py, const FT &pz, const FT &qx, const FT &qy, const FT &qz, const BFT &bxmin, const BFT &bymin, const BFT &bzmin, const BFT &bxmax, const BFT &bymax, const BFT &bzmax) { const BFT bx[] = {bxmin, bxmax}; const BFT by[] = {bymin, bymax}; const BFT bz[] = {bzmin, bzmax}; const int sx = qx < 0; const int sy = qy < 0; const int sz = qz < 0; // Determine bounds x, y, and z FT xmin = (bx[sx] - px) / (qx - px); FT xmax = (bx[1 - sx] - px) / (qx - px); FT ymin = (by[sy] - py) / (qy - py); FT ymax = (by[1 - sy] - py) / (qy - py); FT zmin = (bz[sz] - pz) / (qz - pz); FT zmax = (bz[1 - sz] - pz) / (qz - pz); // Determine the bounds of the overlapping region FT min = std::max({xmin, ymin, zmin}); FT max = std::min({xmax, ymax, zmax}); // The ray intercepts if this region overlaps with the interval provided Uncertain<bool> new_result = (max >= min) & // Check if there is any overlap at all !(bounded_0 && max < 0) & // Check if the overlap is outside the ray (before the start) !(bounded_1 && min > 1); // Check if the overlap is outside the ray (after the end) return new_result; } | 000000000040c460 <CGAL::Intersections::internal::do_intersect_bbox_segment_aux<...>(...)>: 40c460: f2 0f 11 7c 24 e0 movsd %xmm7,-0x20(%rsp) 40c466: 66 0f ef ff pxor %xmm7,%xmm7 40c46a: 31 c0 xor %eax,%eax 40c46c: 66 44 0f 28 c0 movapd %xmm0,%xmm8 40c471: 66 0f 2f fb comisd %xmm3,%xmm7 40c475: f2 0f 10 44 24 10 movsd 0x10(%rsp),%xmm0 40c47b: f2 0f 11 74 24 d0 movsd %xmm6,-0x30(%rsp) 40c481: 66 0f 28 f3 movapd %xmm3,%xmm6 40c485: f2 41 0f 5c f0 subsd %xmm8,%xmm6 40c48a: f2 0f 11 44 24 d8 movsd %xmm0,-0x28(%rsp) 40c490: f2 0f 10 44 24 18 movsd 0x18(%rsp),%xmm0 40c496: 0f 97 c0 seta %al 40c499: 31 f6 xor %esi,%esi 40c49b: 66 0f 2f fc comisd %xmm4,%xmm7 40c49f: f2 0f 11 44 24 e8 movsd %xmm0,-0x18(%rsp) 40c4a5: 48 89 c7 mov %rax,%rdi 40c4a8: f2 0f 5c e1 subsd %xmm1,%xmm4 40c4ac: f2 0f 10 44 24 08 movsd 0x8(%rsp),%xmm0 40c4b2: f2 0f 10 5c c4 d0 movsd -0x30(%rsp,%rax,8),%xmm3 40c4b8: b8 01 00 00 00 mov $0x1,%eax 40c4bd: 89 c2 mov %eax,%edx 40c4bf: 40 0f 97 c6 seta %sil 40c4c3: 31 c9 xor %ecx,%ecx 40c4c5: f2 0f 11 44 24 f0 movsd %xmm0,-0x10(%rsp) 40c4cb: f2 0f 10 44 24 20 movsd 0x20(%rsp),%xmm0 40c4d1: 66 0f 2f fd comisd %xmm5,%xmm7 40c4d5: f2 41 0f 5c d8 subsd %xmm8,%xmm3 40c4da: f2 0f 5c ea subsd %xmm2,%xmm5 40c4de: f2 0f 11 44 24 f8 movsd %xmm0,-0x8(%rsp) 40c4e4: f2 0f 5e de divsd %xmm6,%xmm3 40c4e8: 0f 97 c1 seta %cl 40c4eb: 29 fa sub %edi,%edx 40c4ed: 48 63 d2 movslq %edx,%rdx 40c4f0: f2 0f 10 44 d4 d0 movsd -0x30(%rsp,%rdx,8),%xmm0 40c4f6: 48 63 d6 movslq %esi,%rdx 40c4f9: f2 41 0f 5c c0 subsd %xmm8,%xmm0 40c4fe: f2 44 0f 10 44 d4 e0 movsd -0x20(%rsp,%rdx,8),%xmm8 40c505: 89 c2 mov %eax,%edx 40c507: 29 c8 sub %ecx,%eax 40c509: 29 f2 sub %esi,%edx 40c50b: 48 98 cltq 40c50d: 48 63 d2 movslq %edx,%rdx 40c510: f2 44 0f 5c c1 subsd %xmm1,%xmm8 40c515: f2 0f 5e c6 divsd %xmm6,%xmm0 40c519: f2 0f 10 74 d4 e0 movsd -0x20(%rsp,%rdx,8),%xmm6 40c51f: 48 63 d1 movslq %ecx,%rdx 40c522: f2 0f 5c f1 subsd %xmm1,%xmm6 40c526: f2 0f 10 4c c4 f0 movsd -0x10(%rsp,%rax,8),%xmm1 40c52c: f2 0f 5c ca subsd %xmm2,%xmm1 40c530: f2 0f 5e f4 divsd %xmm4,%xmm6 40c534: f2 44 0f 5e c4 divsd %xmm4,%xmm8 40c539: f2 0f 10 64 d4 f0 movsd -0x10(%rsp,%rdx,8),%xmm4 40c53f: f2 0f 5c e2 subsd %xmm2,%xmm4 40c543: f2 0f 5d f0 minsd %xmm0,%xmm6 40c547: f2 0f 5e cd divsd %xmm5,%xmm1 40c54b: f2 44 0f 5f c3 maxsd %xmm3,%xmm8 40c550: f2 0f 5e e5 divsd %xmm5,%xmm4 40c554: f2 0f 5d ce minsd %xmm6,%xmm1 40c558: 66 0f 2f f9 comisd %xmm1,%xmm7 40c55c: 0f 96 c0 setbe %al 40c55f: f2 41 0f 5f e0 maxsd %xmm8,%xmm4 40c564: 66 0f 2f cc comisd %xmm4,%xmm1 40c568: 0f 93 c2 setae %dl 40c56b: 21 d0 and %edx,%eax 40c56d: 31 d2 xor %edx,%edx 40c56f: 88 c2 mov %al,%dl 40c571: 88 c6 mov %al,%dh 40c573: 89 d0 mov %edx,%eax 40c575: c3 retq |
By using bitwise operators in place of normal boolean logic, we get true straight-line code. |
I should emphasize that the fact that FT is a double is only incidental, and this can't be relied upon for general use. Neither of these solutions behave exactly the same as the original version either, and they're liable to be giving false precision by never returning uncertain results. Nevertheless, it's useful to have a branchless version of the intersection function, and this lets us evaluate the advantages of some approaches that can be combined with it.
4.8.2 Benchmarking the Child-Skipping Optimization
Last month, Andreas experimented with adding a child-skipping optimization to the aabb-tree's traversal code. By retrieving the grandchildren of each node, we can compare our bounding boxes in groups of 4 at a time. My intention is to examine how the Child-Skipping optimization combines with our branchless intersection code. Because the child skipping optimization uses the intersection function with groups of bounding boxes at a time, the hope is that when those are inlined they can be combined into a vectorized version of the same function.
To this end, I would like to be able to use the child-skipping
approach in benchmarks,
once this is done it becomes trivial to integrate it with the branchless
intersection code (which is already working in the benchmark on another
branch).
Unfortunately because I'm using remesh_polyhedral_surface_sm
as my benchmark, the
child-skipping approach isn't immediately compatible with existing code.
I had hoped the conversion process would be as simple as enabling
the 4-way traversal with a #define
,
as in Andreas' example code.
The first issue I encountered was that there are actually a large number
of intersection traits types provided by the AABB-tree,
and I had to add
the 4-way intersection function to each of them.
That helped, but I then discovered that the program also used a traits
class defined in the Polygon_mesh_processing package:
Ray_3_Triangle_3_traversal_traits.
Some traits classes have slightly different behavior, which meant that I
couldn't simply factor the 4 way intersection out.
Ray_3_Triangle_3_traversal_traits has some unique behavior of its own,
but it also has a template specialization that uses a much more
elaborate intersection function!
This intersection function (which conditionally invokes the branchless
one) is the source of a lot of trouble.
It's challenging to produce a straight-line version of it, and it may
also prevent the branchless code it invokes from being inlined the way
I'd like.
These complications make child-skipping a much less appealing optimization; I originally thought that it could be done in a low-impact way, only affecting certain traversal functions, instead it requires invasive changes to the way traversal is done. We have no guarantee that this change would actually be a net improvement to performance, let alone a quantitative prediction for its value. As a result, I'm considering shelving this change and dedicating time to examining more straightforward solutions instead.
4.8.3 Preparing to implement N-Way Splitting
4.8.3.1 Planning
This list of steps is by no means complete, it primarily focuses on the member variables the Node type is composed of. Each of the changes has implications for the rest of the code, for example putting my pointers in a small array will require changes to the relevant accessors. The idea is that changes to how the data is structured will be the driver for the rest of the work. I expect each change to cause minor issues throughout the code-base, which I can fix before moving on to the next step. By breaking things up in this way I keep the code close to a working state at every point during the development. I expect the largest challenges to come from other packages in CGAL with a dependence on particular implementation details of the existing tree; with luck I'll be able to change them to better respect the AABB-tree's abstractions.
Steps:
- Convert children from a pair of independent pointers to an array of pointers with size 2.
void *left, *right;
→std::array<void *, 2> children;
- Use
std::variant
to differentiate between child pointers and primitive pointers (rather than using avoid *
)std::array<void *, 2> children;
→std::array<std::variant<Node *, Primitive *>, 2> children;
- Replace array of pointers with pointer to array, to pack the relevant data closer together.
std::array<std::variant<Node *, Primitive *>, 2> children;
→std::array<std::variant<Node, Primitive *>, 2> *children;
- Replace uses of left and right children with operations applied to the list.
func(n.left()); func(n.right());
→for (const auto &c : n.children()) func(c);
- Parametrize size of children list using a hard-coded value or a
#define
.std::array<std::variant<Node *, Primitive *>, 2> children;
→std::array<std::variant<Node *, Primitive *>, N> children;
- Replace hard-coded child count with a template parameter for the node, having a default value of 2.
- Add child count template parameter to tree's Traits class, using that value when instantiating the node class (again, with a default value of 2).
- Change default value to 4 or 8.
Additionally, once these changes are made it becomes simpler to pack the bounding boxes efficiently:
- Move BBox into parent node.
- Break BBoxes into efficiently packed parallel arrays.
4.8.3.2 Child Array
The first planned change was to replace the independent children with an std::array
containing them.
This change went very quickly, and because the Node class is nicely
encapsulated the difference could be accounted for simply by altering
the accessors.
I'm using the aabb_any_all_benchmark.cpp
test to verify the program, and it actually
showed a marginal but consistent improvement in performance!
This improvement is around 1% for all kernels used, and may come from std::array
's
semantics, (for example, perhaps both children can be initialized to nullptr
in a more
efficient way).
4.8.3.3 std::variant
Children
Because CGAL doesn't yet depend on C++17, I substituted boost::variant
for std::variant
,
this went almost as smoothly as switching to an array, besides some
extra effort to initialize the node with nullptrs of the correct type.
variant
provides some additional safety guarantees, but that comes with a performance
penalty;
the same program runs a little over 1% slower than the original.
4.8.3.4 Pointer-to-array replaces Array-of-pointers
Though not strictly necessary, it seemed prudent to replace my array of pointers of nodes with a pointer to an array of nodes. This could theoretically produce simpler syntax, but more importantly it guarantees that nodes with the same parent will be adjacent in memory.
The first complication I ran into here was that the nodes are
already being stored in an underlying vector.
Whenever we allocate new nodes, they get emplaced at the back of the
vector.
This does mean that nodes hold references to other places in the vector,
which would be invalidated if the vector were to reallocate.
To avoid this, the vector has enough space reserved beforehand to avoid
the need to reallocate for even the largest (degenerate case) tree.
To have pointers to arrays, I actually need to have pointers to subsets
of the underlying node vector.
CGAL isn't on C++20 (yet), so I can't use std::span
; instead, I'm using a bare pointer
to the group's first element as a stopgap solution.
The next issue was that I needed to change the contents of the underlying vector.
Because nodes were allowed to contain a combination of primitives and other nodes, the vector needed
to be able to do the same.
As an temporary solution, this is done with the help of boost::variant
.
In the next step, I'm planning to change the structure of the tree, so
that each node can contain either a pair of nodes or a primitive.
This moves the use of variant
into the Node type, reduces pointer chaining, and opens
the door to experimenting with bucket sizes in the future.
I expected this to perform worse than previous methods, primarily because the underlying node vector is much more poorly packed. My expectations were borne out, as the new code performs nearly 10% worse.
4.8.3.5 Single-primitive leaves
I was expecting this change to be more challenging, because it's a fundamental difference in how the tree is constructed.
In the original tree structure, a node could point to a pair of child nodes (not shown), or a child node and a primitive (the root node), or a pair of primitives (the first node on the right).
N / \ P N / \ P P
With the new way I'm defining the node type, it can either point to a pair of child nodes (which are contiguous in memory) or a single primitive.
N \ {N, N} | \ P {N, N} | | P P
The implementation actually turned out simple (each node contains an
std::variant determining whether it has children or a primitive),
and the Node class is well encapsulated, so I could hide the change with
shim code in the left_data
and right_data
methods.
I'll need to remove explicit references to "left" and "right" when I
increase the number of children, so these methods will be temporary.
I expected this to hurt performance, but if it does that's counteracted by the node's better arrangement in memory. This performs significantly better than the last approach, only a couple of percent worse than the original.
4.8.3.6 Simplified Traversal Method
This is where the previous changes start to pay off. A node now contains either a single primitive, or the set of its children (for now, always a pair). This is a reduction in the number of valid states a node can be in, and actually simplifies the traversal function. Not only is this an improvement for readability, it actually results in better performance, too. The traversal benchmarks run between 5 and 15% faster than the original version, depending on the kernel!
4.8.3.7 Simplified Ray-Traversal Method
The ray_intersection()
method in AABB_ray_intersection.h
behaves a little bit differently than the tree's built in traversal
function.
That method can be used to find out if there are any intersections with
the query, or to collect all intersecting primitives, depending on the
provided traversal traits.
This method specifically finds the first intersected primitive (closest
to the source), naturally, this is a concept specific to the ray type.
The most clearly relevant use here is rendering, where it's obvious that
only the closest intersection is needed.
To achieve this functionality efficiently, a very different approach is used. A priority queue is used to look at the nearest nodes first, and the closest intersection is updated whenever a closer one is found. The traversal doesn't stop until the next nearest node is further than current closest intersection, because none of its children could possibly be closer.
Similar to the other traversal method, the algorithm is complicated by the number of states the node is allowed to be in. Now that we have a less complicated node type, we can simplify this method significantly. Here, the simplifications actually hurt performance by around 10%. My theory is that before primitives would immediately be checked for intersection when they were encountered, but now they are being added to the queue, since they have their own nodes with bounding boxes. There are workarounds to prevent this from happening, which I plan to explore once other optimizations have been applied.
4.8.3.8 Eliminating use of "left" and "right"
The left/right data accessors actually fell out of use when I simplified the traversal functions. Because they didn't reflect the structure of the tree anymore, they didn't have any reason to appear in the new version of the functions.
The left/right child accessors were a more complicated situation. These make perfect sense in the context of the tree's current structure, but if I want to have more than two accessors than it's obvious that they can't be used. Instead, wherever an operation was performed on the left and then the right children, I substituted an iterator over the children. This tended to complicate things only minimally, the main problems came from where left and right nodes aren't treated uniformly. When distributing primitives between a pair of nodes the total count was divided in half to find the count for the left node, and the right node received any that were left over. This is a nice solution for a pair of nodes because it doesn't require special considerations for rounding, but it won't work for larger numbers of nodes. Instead, I created a function that decides how many primitives each child should get. What's convenient is that this function could distribute the primitives however I'd like, it just needs to be consistent because it will be used for both construction and traversal. Naturally, the best solution is to make the distribution as even as possible.
This change did hurt performance, and I suspect that these newly introduced loops aren't being unrolled. I also suspect that the code for determining how many primitives a node should contain is performing poorly. Overall, benchmarks run around 50% slower.
4.9 July 17 - July 24
4.9.1 Enabling N-Way Splitting
All of my preparatory work paid off, and getting a tree with N-way splits was accomplished by simply replacing all instances of a hard-coded "2" with a variable N. By changing N, I successfully produced trees of different widths.
Changing N to 4 results in a tree that is traversed marginally slower (a couple of percent); this makes sense, because generally it's now necessary to intersect with more bounding boxes as we descend the tree. This doesn't become an advantage until we begin using SIMD to perform those intersections at minimal additional cost.
I also discovered that the earlier change which massively hurt performance was actually due to
omitting the traits.go_further()
check that stops the search early.
Reintroducing this check actually greatly improved performance, and the
current code is actually only around 12% slower than the original.
A clear future step is to define a solid API for the modified tree (for example, N should be a template parameter), but serious time shouldn't be spent on that until after we can show a performance advantage.
4.9.2 Cleaning up Traversal code
The following is my solution for performing a traversal on a N-way tree. The intention of splitting the recursive into two for loops is to enable SIMD between the different invocations of the non-recursive loop. If we calculated each box intersection just before deciding to recurse, there would be no potential for them to combined into a batched operation.
AABB_node<Tr>::traversal(const Query &query, Traversal_traits &traits, const std::size_t nb_primitives) const { if (!traits.go_further()) return; // This is a Depth-first traversal if (nb_primitives == 1) { // If there's only one primitive, we've reached a leaf node traits.intersection(query, data()); } else { // Determine whether each child intersects std::array<bool, N> intersections{false}; for (size_t i = 0; i < N; ++i) intersections[i] = traits.do_intersect(query, children()[i]); // Recursively check each child that intersected for (size_t i = 0; i < N; ++i) { if (intersections[i]) children()[i].traversal(query, traits, num_primitives(i, nb_primitives)); } }
This is where it becomes clear that we're reaching the limits of optimization that can be done simply by analyzing assembly. When this is compiled along with the branchless intersection method from earlier, the result is a tangle of jumps which is a challenge to interpret.
The issue here isn't the number of jumps though: even when the intersection code is branchless, the compiler misses the opportunity to unroll a loop and vectorize across the inlined functions!
4.9.3 Benchmarking Different types of Intersections
Pierre and I have been looking at the idea of putting a using a bounding box (or several) around the query to speed up traversal. This strategy could make a big difference, because CGAL's bbox-bbox intersection doesn't require any special care to preserve exactness; it's algorithmically simpler, and it is far more readily compatible with SIMD techniques. The hope is that we can get some performance advantage by not doing intersections directly with the query until it's absolutely necessary.
We can get an idea of how big the performance advantage could be by examining the time it takes to perform different types of intersections. A boxed-query approach would replace ray-bbox intersections with simpler but broader bbox-bbox intersections. This means we may intersect with more nodes, but each individual intersection computation will be cheaper. Pierre's intuition was that bbox-bbox intersections should be at least 10x faster than ray-bbox intersections, and perhaps 100x faster than intersections with primitives. To find real-life numbers for these, I created a simple benchmark of the different intersections. This benchmark tested intersections between 1,000 of the query type and 10,000 of the target type, repeated 100 times with interleaved runs as usual. The total test takes a couple of minutes to run, but I divided by the total number of intersections to get the time for just one.
Intersection Type | Time per Intersection (s) |
---|---|
Bbox-Bbox | 2.5e-14 |
Ray-Bbox | 1.34778e-08 |
Ray-Triangle | 8.25178e-08 |
Rather than 10x faster than ray-bbox intersections, bbox-bbox intersections were more than 500,000x faster! The difference likely won't be quite as spectacular in a real context (in the tree the more expensive part is probably chasing pointers), but these results make me pretty optimistic that we can get a performance improvement!
4.9.4 Implementing a simple Boxed Query technique
My solution for simple boxing of queries has the following features:
- A
Boxed_query<Q>
class, which contains a reference to the query object and a bounding box that encloses it. - A constructor for the Boxed query which takes a query type and an Iso_cuboid boundary. This facilitates infinite query types like rays, cutting them at the border so that they can have a finite bounding box. (There's also a simpler constructor for non-infinite types)
- A custom
CGAL::do_intersect
function which tries to rule out the possibility of an intersection using the bounding box, only falling back to direct intersection with the query if it's necessary.
While I was developing this, I modified the benchmarking code to save the boolean results of the intersections to an output vector. This let me guarantee that the results were correct, but it also massively hurt performance.
Intersection Type | Time per Intersection (s) |
---|---|
Bbox-Bbox | 3.54384e-09 |
Ray-Bbox | 2.71308e-08 |
Boxed-Ray-Bbox | 2.08792e-08 |
This is promising! The boxed method is nearly 10x slower than plain box-box intersection, which makes sense because its time is likely dominated by the cases where it needs to fall back to ray intersection. Nevertheless, it still beats plain ray-box intersection, while providing equivalent functionality. This test includes the overhead of constructing the boxed query for every ray being intersected; this is actually rather expensive (it uses a boolean operation with an Iso_cuboid); a more effective method of cutting the infinite query type could result in a further performance advantage.
Removing the validation step makes the advantages a little bit more clear, with this change the boxed-ray beats the conventional method by more than 25%!
Intersection Type | Time per Intersection (s) |
---|---|
Bbox-Bbox | 3.6e-14 |
Ray-Bbox | 2.24505e-08 |
Boxed-Ray-Bbox | 1.80355e-08 |
Notice how bbox-bbox intersection was impacted the most by validation. This is likely in part because it's a small function that can be vectorized well by the compiler; adding the result to a vector after each invocation of the function almost certainly interferes with vectorization between function calls. Even without that factor, its time was dominated by the overhead of saving the result.
4.10 July 25 - August 2
4.10.1 Using Boxed Queries to Traverse the Tree
With the performance potential confirmed, it made sense to try to take advantage of boxed queries when traversing the tree.
4.10.1.1 Specialized Traversal Function
The core of this strategy is a new template specialization for the traversal function, so that we can treat boxed queries specially.
template<typename Tr> template<class Traversal_traits, class Query> void AABB_node<Tr>::traversal(const Boxed_query<Query> &boxed_query, Traversal_traits &traits, const std::size_t nb_primitives) const;
For our first prototype, we wanted to traverse with a single bbox surrounding the query. This bbox is reconstructed for each sub-node, because it only needs to surround the region of the query that falls within that node's own box. If we tried to use the same node for the entire depth of the traversal, it would quickly become much larger than the boxes it's being tested against.
Our specialization of the traversal function builds the new box for each child before it recursively traverses them, it also performs an intersection directly using the contained query directly once it reaches a primitive.
4.10.1.2 Invalidating Traversal Traits Assumptions
When I tried to directly intersect the query and the primitive in the base case of the traversal, I ran into a small snag.
The AABB_traversal_traits classes each provide two functions:
bool do_intersect(const Query& query, const Node& node)
for intersections with
each child when determining whether to traverse recursively,
and void intersection(const Query& query, const Primitive& primitive)
for
direct intersections with primitives (the traits class is responsible for collecting the results).
The Query
type is a template of the traits class, and that
comes with the implicit assumption that the type will be the same for
both types of intersections provided.
Since we want to intersect directly using Boxed_query
's underlying query type when we
reach a primitive, this assumption no longer holds.
As a temporary solution, these two functions are independently templated, with names DoIntersectQuery
and IntersectionQuery
, respectively.
This shouldn't have any performance penalty, but it does cut against the grain of the Traits
system's original design;
eventually I think it's important to find a cleaner solution.
4.10.2 Issues with Traversal Correctness
Even with careful checking of the algorithm, the boxed query traversal was producing different results than the original "naive" solution. It consistently detected fewer intersections (6,114 vs 6,708, for example). I was unable to eliminate this issue without also taking away any performance advantage of the approach. One solution was to use a pessimized version of the algorithm: rather than assuming if the box doesn't intersect than its contained primitive must not, we can assume that if the box does intersect, so must its contained query (and if the box doesn't, try the query directly just in case). This is obviously wasteful, resulting in examining many more nodes than should be necessary, but it does guarantee that all intersections are found, producing the correct result.
Tracking these errors revealed that the missed intersections
stemmed from "false negatives" when using the bbox as a filter for
intersections between queries and bboxes.
Examining the problematic scenarios showed that malformed boxes were not
properly surrounding the underlying queries.
This points to an issue in the Boxed_query
constructor.
Pierre and I had a discussion, and our conclusion was that the Intersection system (doing the boolean operation between query and bbox) wasn't producing a conservative result because the kernel was inexact. For near-tangential intersections (where the ray query barely skims the surface of the bbox) the errors produced can become very pronounced. As a temporary solution, I'm testing whether using an exact construction kernel (Epec) will prevent these issues. If I get correct results with the Epec kernel, that means that this problem can be solved by creating a more conservative approach for putting a bbox around the query.
4.10.2.1 Compatibility with Epeck
Getting the boxed query system working with Epeck was more challenging than expected.
Introducing it as another test case to the any_all_benchmark
caused a wave of template
errors;
which surprised me because I had expected it to have an identical interface to the other kernels.
Fixing the issues helped me discover and eliminate an unnecessary special case leftover from the old
code,
and it ensured that all treatments of the boxed_query are done where I expect (before I had relied
on a custom cast operator).
Once Epeck was working, I was able to confirm that it always produced correct results. This means that our issue must stem from the query boxing approach not being conservative, as we thought.
4.10.3 Interval-based Boxing
After some experimentation with template specializations for lines and rays, I found that the simplest approach for adapting the existing boxing system was to use CGAL's kernel converter type. This let me make very few changes to the existing code, and preserve compatibility with the same range of Query types.
I eventually found that using Interval_nt<false>
instead of Interval_nt<true>
made a massive performance difference.
typedef Simple_cartesian<Interval_nt<false>> Interval_kernel; typedef typename Kernel_traits<Query>::Kernel Query_kernel; typedef typename Type_mapper<Query, Query_kernel, Interval_kernel>::type Interval_query; typedef Cartesian_converter<Query_kernel, Interval_kernel> Converter_K_to_Interval; ... Converter_K_to_Interval K_to_Interval; interval_query = K_to_Interval(query); ... static Bbox_3 bbox_for_enclosed_segment(const Interval_query &query, const Bbox_3 &boundary) { Interval_nt<false>::Protector p; auto b = Iso_cuboid_3<Interval_kernel>(boundary); auto overlap = CGAL::intersection(query, b); // If there wasn't any overlap, return an empty bounding box if (!overlap) return Bbox_3{}; // Otherwise, get the bounding box of the result return boost::apply_visitor([](auto &&boolean_intersection_result) -> Bbox_3 { return boolean_intersection_result.bbox(); }, *overlap); }
This allowed for full correctness, but came with a massive performance penalty. Creating a conservative bbox is a very expensive process, and it eliminated the advantage that the boxing strategy had.
4.10.3.1 Performance Results
As usual, benchmarking is done with the aabb_any_all_benchmark
.
This program builds a tree from a triangle mesh that it loads from a file,
and then generates a large number of segment, ray, and line queries which it casts into the tree.
Results were disappointing, with a decrease in performance for all kernels.
Kernel | Naive | Boxed |
---|---|---|
Simple cartesian float kernel | 0.571318 | 1.39495 |
Cartesian float kernel | 0.896336 | 1.51499 |
Simple cartesian double kernel | 0.497932 | 1.32457 |
Cartesian double kernel | 0.834149 | 1.45415 |
Epic kernel | 1.1171 | 1.47604 |
Epec kernel | 1.44885 | 1.74818 |
Clearly it's expensive to create these bounding boxes with full correctness, so we'll need to employ a more elaborate strategy to reduce the time we spend on that problem. I think we have two sides to approach that problem from:
- Reduce how expensive it is to create these bounding boxes (e.g. a simpler function to create conservative boxes using interval arithmetic)
- Reduce how often we need to compute them (e.g. only put a new box around the query when the current box stops being useful for filtering, instead of every time we descend the tree)
Of these options, I believe that the second has more promise. Ultimately, creating a conservative bounding box can only be optimized by a certain amount, especially while retaining our genericity.
For future sections, statements on performance will focus on the Epic kernel
results. The Epic kernel is the most likely to see a useful improvement, because it has the most expensive ray-bbox intersections. |
4.10.4 Tunable Parameters
During a pair programming session, Andreas and I explored several avenues for improving our performance. We identified a collection of parameters that could be tuned to improve our results.
In the following table, all variations not listed in (parentheses) have been built and benchmarked. Each is discussed in more detail below.
Conditional Shrinking |
|
---|---|
BBox Count | 1, 2, 4, (N) |
Children per Tree-node | 1, 2, 4, 8, (N) |
Trust in the Filter |
|
Including the naive approach, we have a total of 73 different combinations of optimizations.
We didn't do a grid-search of the parameters, but as we implemented the different optimizations we tested a variety of combinations. Some of the parameters have independent affects on performance, but others depend strongly on what other optimization are active.
4.10.4.1 Conditional Shrinking
This parameter determines when we create a new bounding box for our boxed query. This is important because it effects two opposing aspects of our performance:
- It can increase or decrease how often the
bbox_for_enclosed_segment
function is invoked, an operation we've determined to be very expensive. - It can decrease or increase how tightly the bbox encloses the underlying query, which affects how many nodes we end up visiting.
We applied three different approaches, but also discussed the possibility of others.
Never Shrink
Here, we build the Boxed_query at the start of the traversal and then never update it again. This has the advantage of rarely requiring building a new bbox, but it means that the bbox will dwarf the node's boxes as we get deeper in the tree. The result was that the enclosing box quickly becomes useless, and overall it detracted from performance.
Always Shrink
This is on the opposite end of the spectrum.
For this strategy, we create a new (smaller) bbox each time we descend the tree into a child.
This ensures that the box always fits the query as tightly as possible, but it means that bbox_for_enclosed_segment
is used constantly.
It does better than never shrinking, but not particularly well.
Shrink when all Children Intersect
This strategy falls between the other two. It uses a heuristic to determine whether the box needs to be made smaller. We discussed several heuristics, but the only one implemented depends on intersections with child nodes. If the current box intersected with all the child nodes, then it might be too big (it's large enough to touch them all). When this happens, we shrink the box and continue traversing the tree.
4.10.4.2 Multi-BBox
By using more than one box to wrap the primitive, we may be able to get an unintuitively large performance advantage. This is because of some nonlinearities in the way that costs and benefits scale as the number of bboxes scale:
- Using two boxes to surround a segment instead of one reduces the total combined volume, since the boxes can be smaller. Importantly, the new boxes aren't just 1/2 the total volume -- they're 1/4 the size.
- Creating two boxes instead of one is less than twice as expensive.
Getting the start and end points requires the expensive boolean
intersection operation, but to create two boxes we only need one more
point in between, which can be found using a relatively inexpensive
midpoint()
operation.
1 Box
If we use only one BBox but shrink it for each child, it's already relatively effective at wrapping the query. Using CGAL's profiling utility we were able to determine that it eliminated around 25% of direct query-bbox intersection. This represents a sizeable reduction in the number of expensive intersections we need to do.
4 Boxes
Increasing the number of boxes began to run into diminishing returns. For four boxes, we saw a reduction of around 35%. This is good, but we were hoping that the theoretical 16x reduction in query box volume would make a bigger difference.
We're still interested in trying larger numbers of boxes. As the number gets higher, it may make a non-shrinking approach more viable.
4.10.4.3 Children per node
Increasing the number of children per node makes sense when there's an "economy of scale" for query-bbox intersections. We found that this wasn't the case for naive linear queries, and increasing the width of the tree generally decreased performance. It makes sense for Embree to use wider trees because their ray-bbox intersection code can be vectorized, allowing them to do the extra intersections simultaneously. This optimization may also be applicable to our situation, but for different reasons.
2 Children
This is the standard case, and it tended to perform the best for most configurations.
8 Children
When we use singly-boxed queries and always shrink them it tends to be worthwhile to use wider trees. In certain configurations, we even saw improvements in performance up to a 12-way tree. This makes sense, because there's a cost to shrink the bbox which is amortized over the number of times it's used for intersections. Just as important, a wider tree can be much shallower, which reduces the number of times we need to shrink the box.
4.10.4.4 Trust in the Filter
There is more than one way to use our boxed query's bounding box. We have two options, with their own benefits and drawbacks.
Fall back to direct intersections when uncertain
Our original solution was to use the box as a negative filter. This means that if the two boxes don't intersect, then it's safe to assume that the query and the node's box don't intersect, since boxes are drawn conservatively. If the two boxes do intersect, we would then do (expensive) query-bbox intersection. The code makes use of short-circuiting logic, and looks like this:
template<typename Query, typename Other> inline bool might_intersect(const Boxed_query<Query> &q, const Other &other) { return CGAL::do_intersect(q.bbox, other); } template<typename Query, typename Other> inline bool do_intersect(const Boxed_query<Query> &q, const Other &other) { return might_intersect(q, other) && do_intersect(q.query, other); }
This has the advantage that the boxed query's do_intersect
function is always guaranteed to produce the same result as directly
intersecting with the query.
This means that the traversal will visit identical nodes to the naive
solution, which is useful for testing correctness.
Not visiting additional nodes also has a theoretical performance
advantage, though it's outweighed by the cost of falling back to a
direct query-bbox intersection.
Always trust that bbox-bbox intersections are representative
I found it was better to assume all intersections between boxes correspond to real intersections with
the query.
Implementing this was done by making might_intersect
and do_intersect
equivalent.
template<typename Query, typename Other> inline bool might_intersect(const Boxed_query<Query> &q, const Other &other) { return CGAL::do_intersect(q.bbox, other); } template<typename Query, typename Other> inline bool do_intersect(const Boxed_query<Query> &q, const Other &other) { return might_intersect(q, other); }
Using CGAL's profiling tools, we could determine that this is because bbox intersections are a very good predictor of real intersections. Over 95% of the time, when the boxed query's bbox intersects with the other node, so does the query itself!
When we approach the problem this way, we slightly increase the number of query-primitive intersections, that difference is vanishingly small in comparison to the number of query-bbox intersections we save.
4.10.4.5 Optimal Results
My first attempt had the following parameters:
- Always shrink (every time we descend the tree)
- 1 box wrapping the query
- 2-way tree
- Fall back to direct query-bbox intersections when uncertain
We tested a wide variety of combinations, and the best we found included the following parameters:
- Heuristic shrinking (only when all children intersect)
- 1 box wrapping the query
- 2-way tree
- Never fall back to direct query-bbox intersections
Choosing the right approaches made a significant difference to performance:
Kernel | Naive | Original Boxed | Optimized Boxed |
---|---|---|---|
Simple cartesian float kernel | 0.571318 | 1.39495 | 0.935523 |
Cartesian float kernel | 0.896336 | 1.51499 | 1.0592 |
Simple cartesian double kernel | 0.497932 | 1.32457 | 0.833513 |
Cartesian double kernel | 0.834149 | 1.45415 | 1.054 |
Epic kernel | 1.1171 | 1.47604 | 1.0115 |
Epec kernel | 1.44885 | 1.74818 | 1.28246 |
Even so, this result isn't without caveats. The optimized box approach only beats the naive approach for particularly heavy kernels. This is because its value is in its reduction of heavy numerical calculations done by the kernel. Bbox-Bbox intersections are very cheap compared to most Epic kernel operations, but this advantage shrinks for something like the Simple cartesian float kernel.
4.10.5 Only Lines Benefit from this Optimization
On Andreas' suggestion, I substituted Ray queries for Lines in the benchmark, this had a rather unfortunate effect on our performance results.
Kernel | Naive | Optimized Boxed |
---|---|---|
Simple cartesian float kernel | 0.613851 | 0.857605 |
Cartesian float kernel | 0.684138 | 0.927448 |
Simple cartesian double kernel | 0.530906 | 0.784656 |
Cartesian double kernel | 0.626222 | 0.921197 |
Epic kernel | 0.811094 | 0.897078 |
Epec kernel | 1.14218 | 1.13796 |
The boxing strategy's performance advantages evaporate, even for the complex kernels that it excelled on!
This is likely happening because the Ray and Segment types have cursory checks to see if their origin is inside the bbox. This is nearly as cheap as bbox-bbox intersection, and tests showed that it's enough to catch intersections a lot of the time (removing this check comes with a sizeable performance penalty) The Line type, on the other hand, is infinite and defined by a line equation. It has no such pre-check, and so it's more expensive on average. Boxing helps us add an equivalent to the pre-check for infinite types like the line, so this is where its performance benefit comes from.
This presents a problem, because the AABB is generally used for Ray and Segment queries, not lines. This makes perfect sense, since line queries are slower without a strategy like boxing. Our work here has identified an optimization path that helps only a small subset of use-cases.
4.11 August 3 - August 10
4.11.1 Feasibility of a Compact Tree Structure
Andreas and I have been discussing the possibility of accessing child nodes using pointer or index math instead of references. This would rely on the fact that the arrangement of the nodes in memory should be deterministic. It comes with several advantages:
- More compact node type, which wouldn't have to store any pointers or references to its children.
- Easy access to both parents and children of nodes, something that would previously have required storing an additional reference.
- Better cache performance (in part because of the more compact representation, in part due to better locality)
I did some research into this type of method, and a relevant concept here is "succinct data structure". An even more relevant term for what we're trying to do is implicit data structure. This field of research has been around perhaps even longer than computers have, and the topic of compact trees has been very thoroughly explored. The best implementation guide I found is this, and it cites a paper from 1964. Wikipedia also has its own short description of the technique. One of the key requirements that the article lists for this type of structure is that the tree be complete, but the wikipedia page acknowledges that this isn't actually a strict requirement: if we're okay with having empty nodes (gaps in our underlying vector), we can actually have any structure we want! Luckily, even with no changes to implementation our current construction strategy already produces nearly-complete trees.
Of course, the advantages and simplicity of implementation lead to one question:
4.11.1.1 Why isn't Embree doing this?
Using an implicit structure has some drawbacks, most of which don't apply to CGAL at the moment, but would if we wanted to incorporate some of Embree's techniques.
- Embree has a tree-building technique that's intended to produce an optimal structure, by criteria such as total bbox surface area. This is naturally more computationally expensive than normal construction, but the built tree can provide faster traversal on average. The tree that's built has no guarantees of being anywhere near complete, and if put into an index-based implementation it may result in very large blocks of empty memory.
- Embree has strong support for dynamic trees, which can be efficiently rebuilt as the stored primitives move around. This is useful for Embree's main purpose, raytracing, because it makes it easy to perform efficient traversals of a scene with moving parts. An array-based tree is more complicated to rebuild, and may not support the type of partial-restructuring that embree uses here.
Using an index-based structure doesn't necessarily close the door to optimizations and features like these in the future, but it would make their implementation more challenging if CGAL wanted to incorporate them someday.
4.11.2 Potential API for an Implicit Tree
There exists one major difference that means conversion to a compact representation isn't straightforward: nodes don't know about their children without additional context about the rest of the tree. In the current tree, a node has a reference to its children which can be followed regardless of whether this node is the root or deep inside the tree. For an index-based tree, the offset used to find the children's index depends on the specific index of that node. The root is directly adjacent to its children, but nodes deeper in the tree will be much further from their children. As a result, to find the location of the next child we need to know both the location of the current child and that of the start of the tree.
My proposal to avoid making traversal much more complex is the addition of a
AABB_node_handle
type, which would encapsulate this information.
You could use this type to produce handles for its parent and children,
and it could provide access to the underlying bbox or primitive.
It could be treated just as we do a node right now, almost as a drop-in replacement.
4.11.3 Preparing to implement Implicit Tree construction
Development will be done on the new implicit-tree-structure branch.
4.11.3.1 Planning
Conversion of the current tree to an implicit structure can be broken into steps, just as conversion to an N-way tree was.
- Add reference to the tree's root as a member of the Node class; use the node's root to
determine the index of the node
index = this - root
. - Add a new child accessor which takes the node's current index as an argument
std::array<Node, N> &children()
→std::array<Node, N> &children(std::size_t index)
. Use this accessor where the original was used, and confirm their results match for shallow nodes. - Update tree's underlying vector to be large enough to hold the tree if it isn't stored compactly (round up the size as though the tree were perfect). Pre-allocate vector with empty nodes (nodes with empty bboxes).
- Update tree constructor to use the new index-based accessor when determining where to allocate child nodes. (The two accessors should now produce identical results for all nodes, since the new one is determining the references stored for the old system).
- Use
boost::variant
to allow the node type to hold either a bbox or a reference to a primitive. The old child pointer should now be used for exclusively that purpose. - Remove the old child pointer (along with correctness check). Node type should now only contain a bbox and a reference to the root.
- Construct-on-access of child nodes: whenever we retrieve the child of a node, use its constructor to re-build it in place from its bbox and the root before returning it (this won't do anything from now).
- Construct-on-access of root node: similar to previous change, but affects the
AABB_tree
type. - Replace vector of nodes with vector of bboxes/primitive references
std::vector<AABB_node>
→std::vector<boost::variant<Bbox, Primitive *>>
. Node type should now contain a reference to the root, and a reference to the bbox/primitive *. Update all accessors to follow this reference. - Rename
Node
toNode_handle
, because it no longer directly contains the node's data. - Substitute
union
whereboost::variant
was used; once the system is confirmed to be working the variant's type checking adds a lot of unnecessary overhead.
4.11.3.2 "Fat nodes" intermediate solution
The first few steps in my plan produce a system where nodes are arranged as an implicit tree, and use pointer math to find their children. In order to make this work while keeping the logic inside the node, however, it was necessary to give it extra information. When each node is constructed, it is given a reference to the root node (this includes the root node, which has a reference to itself). With this information, each node has the following knowledge:
- Its own bounding box
- The location of the root node, at the start of the vector
- The location of itself (through the
this
pointer) - The a pointer to a primitive (if it has one), or to the child nodes (this is redundant, but lets me check my work).
As a result, the node type is slightly larger than the original implementation. Despite this, it still performs almost exactly as well. This suggests that this benchmark is not particularly cache-limited, which makes sense given what we already know.
4.11.3.3 Making the nodes more compact
The children pointer is now strictly unnecessary, but that space is
shared with the pointer to the primitive.
We could stop saving a reference to the children, but that would leave
us with a primitive pointer that's left null for non-leaf nodes.
My solution was to combine the primitive pointer and the bbox, using a
variant type: boost::variant<Bbox, Primitive *>
.
Unfortunately, this changes the topology of the tree in a way that
reduces performance; this will be discussed in a later section.
4.11.3.4 Moving traversal logic to the tree
The node type still contains one redundant value: the root pointer. Nodes need this point of reference to calculate their children's locations, but every node holds the same value!
There are several ways I could go about eliminating this value,
but after a discussion with Andreas we decided that the simplest
approach would be to move the traversal code to the tree.
Since the tree naturally knows where the root node is, we don't need to
do anything special to hold onto it once the function is moved.
node.children()
is replaced by tree.children(node)
. Of course, since this
function is primarily called from within the tree, it often looks like children(node)
.
Despite some initial confusion, I eventually determined that this change had no meaningful effect on performance, one way or another.
4.11.4 The source of N-Way Tree's performance advantage
Andreas and I had a conversation with Sebastien about the performance advantages we were seeing in the N-way tree. For traversals with rays and segments this was as much as a 10% advantage, so we wanted to determine the source. Evaluating some other tree implementations showed that the performance increase was likely the result of a particular difference in the tree's topology. The original AABB had nodes which could hold either two children, two primitives, or a mix. The N-way tree has nodes that can contain either a pair of children or a single primitive. What this means in practice is that each primitive is wrapped by its own bounding box, during traversal this bbox acts as a geometric filter. We previously believed that this filtering was done by the intersection function already, which would have made the bbox redundant; but that didn't turn out to be the case.
In order to confirm that this was the source of the performance gain, we added a geometric filter directly inside the traversal function of a version of the tree which didn't have this topology.
Tree Implementation | Epick Kernel Time (s) |
---|---|
Original | 0.97 |
N-way (1-node per primitive) | 0.88 |
Implicit Structure (same topology as original) | 0.97 |
Implicit Structure with Filtering | 0.91 |
Doing this filter inside the traversal function meant that for the last time, the boxes are calculated on the spot. This may seem wasteful, but it performs nearly as well as the N-way tree, while only taking up as much space as the original!
The tree class actually allows the use to provide a utility for mapping primitives to their bounding
boxes.
By default this simply invokes primitive.bbox()
, but it
could be substituted for a solution which caches the boxes.
This could provide the best of both worlds: it doesn't require building
boxes for primitives more than once, and it doesn't take up as much
space as all those extra nodes in the tree.
4.11.5 Building a tree using a sort along the Hilbert curve
One technique used by Embree to great effect is "Spatial Sorting". This is an alternative method of construction that produces a lower quality tree, but does so very quickly. It's useful for Embree when rendering dynamic scenes; when the primitives contained by a tree move, the tree needs to be rebuilt every frame. This may be useful for CGAL as well, wherever trees are built frequently (or only used a few times). Building a high quality tree is expensive, and only worthwhile if that cost is amortized over many traversals.
To build the tree quickly, Embree starts by sorting the primitives along a space filling curve. This can be done very efficiently because it's possible to give each location on the curve a unique code, and these codes can be used in a Radix sort. When primitives are sorted along a space filling curve, nearby objects will be placed closer together in the list. This is true on multiple scales, as well: the first half of the list and the second half of the list should contain groups of primitives with little spatial overlap. This means that after the list has been sorted once, we can produce a tree of acceptable quality by recursively splitting the list up, and putting boxes around its sections.
If we want to replicate this, we need to do something similar.
We can use CGAL's existing hilbert_sort
method, which arranges primitives along a hilbert
curve.
Luckily, the AABB tree's solution for dividing primitives between nodes
is cleanly separated from the rest of its construction logic.
It is provided as a functor withing CGAL's AABB_traits class.
The relevant part of that functor looks like this:
template<typename PrimitiveIterator> void operator()(PrimitiveIterator first, PrimitiveIterator beyond, const typename AT::Bounding_box &bbox) const { PrimitiveIterator middle = first + (beyond - first) / 2; switch (Traits::longest_axis(bbox)) { case AT::CGAL_AXIS_X: // sort along x std::nth_element(first, middle, beyond, boost::bind(Traits::less_x, _1, _2, m_traits)); break; case AT::CGAL_AXIS_Y: // sort along y std::nth_element(first, middle, beyond, boost::bind(Traits::less_y, _1, _2, m_traits)); break; case AT::CGAL_AXIS_Z: // sort along z std::nth_element(first, middle, beyond, boost::bind(Traits::less_z, _1, _2, m_traits)); break; default: CGAL_error(); } }
This takes the existing list of primitives and partitions it along whichever axis of the Bbox surrounding them is longest. Because this is called from within a recursive function, it's also applied to sub-lists. One critical difference of the hilbert sort approach is that the list only needs to be sorted once, to sort sub-lists would be redundant. In order to prevent that, we need to include a boolean flag that makes sure the sorting isn't done multiple times. Because this is a functor and not a function, it's easy to add that flag as another member variable. It took a lot of debugging to get the solution to this point, but the end result is shockingly simple:
template<typename PrimitiveIterator> void operator()(PrimitiveIterator first, PrimitiveIterator beyond, const typename AT::Bounding_box &bbox) const { // If this is our first time splitting the primitives, sort them along the hilbert curve // This should generally put nearby primitives close together in the list if (!has_been_sorted) { // Create a property map using our Get_reference_point functor auto property_map = boost::make_function_property_map<Primitive, Traits::Point_3, Get_reference_point>( Get_reference_point(m_traits) ); // Our search traits will use that property map typedef CGAL::Spatial_sort_traits_adapter_3<Geom_traits, decltype(property_map)> Search_traits_3; // Perform our hilbert sort using the search traits type with our custom property map CGAL::hilbert_sort(first, beyond, Search_traits_3(property_map)); // In the future, it's not necessary to re-sort the primitives (we can blindly partition them in the middle) has_been_sorted = true; } }
The most complicated part of this was the Get_reference_point
object.
The underlying Primitive type might not be a point type.
In fact, it might be nothing but a reference to a shared object, or it
might have its properties generated on the spot by an outside mechanism.
To account for this, we need to have a system that maps each primitive
to a representative Point_3
so it can be used in the spatial sort.
That's done using the property map type, which requires a unary function:
struct Get_reference_point : public std::unary_function<const Primitive &, typename Traits::Point_3> { const Traits &m_traits; typedef internal::Primitive_helper<Traits> Helper; Get_reference_point(const AABB_traits<GeomTraits, AABBPrimitive, BboxMap> &traits) : m_traits(traits) {} typename Traits::Point_3 operator()(const Primitive &p) const { return Helper::get_reference_point(p, m_traits); } };
Despite some ugliness, this solution fits very nicely into the AABB tree's existing abstractions.
4.11.5.1 Performance
Based on my understanding of the tradeoffs of a spatial sorting tree construction,
I would expect tree construction to become much faster but traversal to become slower.
The new construction times look like the following (averaged over 100 constructions of a tree from
the larger bunny00.off
dataset).
Kernel | Recursive Splitting | Hilbert Sort |
---|---|---|
Simple cartesian float | 0.0581281 | 0.0644998 |
Cartesian float | 0.0999417 | 0.101805 |
Simple cartesian double | 0.0425205 | 0.0429074 |
Cartesian double | 0.0828917 | 0.0933127 |
Epic | 0.0391275 | 0.0417023 |
These results are a little bit surprising! The hilbert sort approach takes marginally more time to build a tree, and I was able to confirm that this affect is consistent across multiple runs. My suspicion is that because CGAL's hilbert sort doesn't use a radix sort internally, we lose out on a lot of its advantage.
It should be acknowledged that these times are all very small -- dwarfed by the time spent traversing in this benchmark. The results for traversal are even more surprising:
Kernel | Recursive Splitting | Hilbert Sort |
---|---|---|
Simple cartesian float | 0.713711 | 0.47457 |
Cartesian float | 1.24824 | 0.840549 |
Simple cartesian double | 0.630177 | 0.446037 |
Cartesian double | 1.16526 | 0.81038 |
Epic | 1.2477 | 0.909729 |
Here, the hilbert sort approach handily outperforms the existing solution! What is going on?
4.11.6 Building a Complete tree with repeated subdivision
In several cases, we've encountered scenarios where it would be beneficial if we could guarantee that our tree was complete. Unfortunately, the current solution for construction the tree doesn't have an obvious mechanism to make this happen. It distributes primitives "shortsightedly", choosing how many go in the left and right subtrees of the current node without any knowledge of the rest of the tree's structure. It's trivial to build a complete tree heuristically, but to do it with only information about the current node is more challenging.
The heuristic it currently uses is to spread the leaves as evenly as possible (for example: 15 primitives --> 8 left, 7 right). Unfortunately, that will often produce non-complete trees. For example, a tree of 6 nodes will look like this:
N / \ N N /| |\ N P N P / \ / \ P P P P
But the complete version which contains the same number of leaf nodes would look like this:
N / \ N N / \ |\ N N P P /| |\ P P P P
Notice that in order to produce a complete tree, we need to make the seemingly-unbalanced choice to distribute 4 of our 6 primitives to the left, rather than 3. What we need is a distribution function that can serve as a drop-in replacement for this function, but produce complete trees.
After thinking through the problem with the help of a whiteboard, we assembled the following formula:
Given a number of primitives to distribute , our goal is to find the number of primitives that should be distributed to the current node's subtrees in order to produce a complete tree with leaves (one for each primitive). The solution should give us and , the number of primitives that should be distributed to the left and right subtrees, respectively.
Where is equivalent to the number of nodes at depth , and is the depth of the tree's deepest full layer.
Here, is the number of remaining nodes if you were to build a tree with leaves.
The first term of this equation is the base number of leaves that would be in the left subtree of our tree of size . The second term finds how many need to be added, with providing the guarantee that we won't more than double the number of leaves, as that would require going more than one level deeper.
For a 2-way tree, finding is simple, because the right subtree should contain all the primitives that the left subtree couldn't fit.
Put together, the complete formula looks like the following:
4.11.6.1 Generalizing to higher-order trees
This math works well for a standard 2-way tree, but completeness is just as important for higher order trees, where each node can have more than 2 children. Luckily, the math was relatively simple to generalize.
We're now finding , the number of primitives in node , for any in the set of children.
Here, is the number of remaining primitives left over after extra primitives have already been distributed to nodes . Besides that, this formula functions in much the same way as the 2-way function.
4.11.7 Better benchmarks for Hilbert-sort construction
Our earlier benchmarks for the Hilbert sort tree construction approach produced some unintuitive results. Conventional wisdom is that building a tree this way should be faster than recursive splitting, but that the tree produced will be of lower quality. In benchmarks, this would show as a large improvement in construction speed, coupled with a small penalty to traversal speed. Our tests showed the exact opposite: construction speed was slightly slower, and traversal was faster!
Development of the hilbert sort traits class is now part of a new pull request. As part of this pull request, it's worthwhile to clarify how and where the new technique outperforms the recursive splitting approach. To this end, I created a suite of benchmarks which show the relevant comparisons. The benchmarking code is heavily templated, so it's simple to measure a large cross-section of potential parameters.
(trees are constructed from handle.off
, and queries are randomly generated line
segments).
CGAL::Simple_cartesian<float> | Recursive Partition | Hilbert Sort |
---|---|---|
construction | 0.000724111 s | 0.000778169 s |
traversal | 4.4037e-07 s | 3.9547e-07 s |
CGAL::Cartesian<float> | Recursive Partition | Hilbert Sort |
---|---|---|
construction | 0.00129678 s | 0.00168347 s |
traversal | 5.5959e-07 s | 4.988e-07 s |
CGAL::Simple_cartesian<double> | Recursive Partition | Hilbert Sort |
---|---|---|
construction | 0.000457487 s | 0.000572949 s |
traversal | 4.2589e-07 s | 3.7067e-07 s |
CGAL::Cartesian<double> | Recursive Partition | Hilbert Sort |
---|---|---|
construction | 0.00112452 s | 0.0015394 s |
traversal | 5.3615e-07 s | 4.7219e-07 s |
CGAL::Epick | Recursive Partition | Hilbert Sort |
---|---|---|
construction | 0.000462438 s | 0.000580323 s |
traversal | 5.5827e-07 s | 4.845e-07 s |
This is a lot of data, so in the future, we'll focus primarily on the last test (the one that uses the Epic kernel). Epick is the heaviest kernel, and its results tend to be consistent with the others.
CGAL::Epick | Recursive Partition | Hilbert Sort |
---|---|---|
construction | 0.000462438 s | 0.000580323 s |
traversal | 5.5827e-07 s | 4.845e-07 s |
Here, construction is a little bit over 20% slower. This is significant, and raises questions considering we actually expected a significant increase in speed. Traversal, on the other hand, is over 15% faster. This is very interesting since we should expect the hilbert-constructed tree to be slower. Both trees produce the same results, which suggests that something might be wrong with the partitioned tree which is preventing it from getting the performance it could.
In any case, if the hilbert sort tree can consistently perform this well, that makes it useful in its own right. There's always the possibility that this performance is the result of a quirk of our input data, so in order to show that that's not the case, we need to test in a variety of different scenarios. For example, using Rays instead of Line segments should result in more intersections, how does that affect performance?
CGAL::Epick | Recursive Partition | Hilbert Sort |
---|---|---|
construction | 0.000458197 s | 0.00057073 s |
traversal | 5.8002e-07 s | 5.0278e-07 s |
Naturally, construction is almost identical because the tree's data hasn't been changed. The increase in intersections has hurt traversal performance, but this is a flat effect -- this data doesn't change the conclusions we can make.
How about if we were to build a tree from a larger dataset?
Before we loaded data from handle.off
, containing 1,165 vertices,
now we'll use bunny00.off
, which contains a massive 37,706 vertices -- over 30x larger!
Unlike the previous test, we should expect this to produce the different results for construction
times.
CGAL::Epick | Recursive Partition | Hilbert Sort |
---|---|---|
construction | 0.0357437 s | 0.0420231 s |
traversal | 8.12154e-07 s | 9.40858e-07 s |
Here, we see no change to the relationship between construction times, but our traversal performance has reversed! Traversing the Hilbert sort tree now takes 15% _slower_ on average, rather than faster. My first guess was now that the test is longer, hilbert sort had a disadvantage from running second (once the CPU is already hot). Swapping the order of the tests hardly affected results, showing that this wasn't the case. Perhaps for deeper trees, the optimality of their structure becomes more important to performance.
4.11.8 Improving Hilbert sort construction performance
4.11.8.1 Bottom-up boxing
Hilbert sort isn't constructing as fast as it could, because the bbox is still being produced on the way down. The hilbert sort process involves recursive splitting, so its only advantage comes from the fact that the sorting process doesn't require knowledge of bounding boxes. Because of this, the boxes can be produced from the bottom up more cheaply (because each box only depends on child boxes, and not on the list of primitives it contains).
We can create our boxes from the bottom up with the addition of a template specialization that treats the hilbert traits class differently. Once this is added, we see a big improvement in construction speed, and the benchmark starts to look a lot more how we expected.
CGAL::Epick | Recursive Partition | Hilbert Sort |
---|---|---|
construction | 0.0297283 s | 0.0230078 s |
traversal | 6.85527e-07 s | 8.32478e-07 s |
Construction is now about 30% better, and traversal is around 20% worse.
4.11.8.2 Multithreading
We can get a further speedup by enabling a multithreaded sort of the primitives, a feature provided by CGAL's hilbert sort implementation.
CGAL::Epick | Recursive Partition | Hilbert Sort |
---|---|---|
construction | 0.0289884 s | 0.00990184 s |
traversal | 6.75906e-07 s | 7.76014e-07 s |
This gives the hilbert sort an even larger advantage, though it's not entirely fair -- the recursive partition method doesn't incorporate even limited multithreading. Hilbert traits can be used to construct a tree nearly 3 times as fast as the typical method, which can make its tradeoffs very appealing.
4.11.9 More challenging benchmarks
The tweaks we made to the new traits class have given it demonstrable advantages over the traditional method, but in order to reason about when it's appropriate we need to see how it performs in extreme use cases. To that end, we created some more difficult benchmarks.
4.11.9.1 Larger dataset
Andreas provided a .PLY file with a high resolution scan of a gargoyle. Its mesh is made up of over 7,000,000 vertices, so building a tree for it is an expensive operation.
CGAL::Epick | Recursive Partition | Hilbert Sort |
---|---|---|
construction | 0.705003 s | 0.441753 s |
traversal | 6.02128e-06 s | 7.83255e-06 s |
The new parallel traits class maintains a lot of its advantage, remaining around 40% the traditional method. The size of the tree also had little effect on the relative cost of traversals, with the lower quality tree still requiring under 20% more time to traverse.
4.11.9.2 Real-life use case
The fact that the new construction can be used simply by swapping the traits type opens up a lot of opportunities for interesting real-life tests. This makes it easy to look at how the tradeoffs we've been measuring affect actual use-cases. Surface mesh segmentation is a perfect worst-case benchmark for this optimization, for a few reasons:
- Because the underlying primitives aren't changed, the tree only needs to be constructed a single time.
- That one tree is used for a large number of queries (one for each vertex in the output mesh).
- Each query is guaranteed to fully descend the tree. This is because the queries are rays shot from the inside of a watertight mesh, so they're guaranteed to hit a face of that mesh. As a result of this, ever traversal is very expensive.
We benchmarked segmentation by timing segmentation_from_sdf_values_SM_example
.
The package subclasses AABB_traits
, so we could swap strategies easily by replacing
what it subclassed.
Segmentation | Recursive Partition | Hilbert Sort |
---|---|---|
time | 11.0 s | 12.7 s |
Here, we see that for this use case, the new technique isn't worth its tradeoffs.
4.12 August 11 - August 17
4.12.1 Quantifying tradeoffs
So when is it actually worth it to use the new traits class? We can actually determine this from the performance data we've already collected, we'll look at the largest test for this:
CGAL::Epick | Recursive Partition | Hilbert Sort |
---|---|---|
construction | 0.705003 s | 0.441753 s |
traversal | 6.02128e-06 s | 7.83255e-06 s |
The new traits class reduced the time to construct a tree by 0.263 s and increased the time to perform one traversal by 0.00000181 s. Taken together, this tells us that the new traits class loses its advantage only once you've performed 145,201 traversals. If you do more than that, then it becomes worthwhile to spend more time to build a higher quality tree, so your traversals can be faster. I expect this number varies quite a bit depending on the exact nature of the data, but the order of magnitude difference should be indicative.
Of course, this result is specific to the size of the tree we built! Construction time and traversal time both vary non-linearly with the number of primitives in the tree, in order to get a better idea of how tree size affects our tradeoffs, we need to perform benchmarks at different scales. I created a new benchmarking approach which progressively shrinks the dataset, and measured the effect that had on construction and traversal times.
# Primitives | Recursive Partition Construction | Hilbert Sort Construction | Recursive Partition Traversal | Hilbert Sort Traversal | ||
---|---|---|---|---|---|---|
1637530 | 0.690989 s | 0.437864 s | 5.82817e-06 s | 7.5607e-06 s | ||
1000000 | 0.481111 s | 0.258386 s | 6.25142e-06 s | 7.0905e-06 s | ||
100000 | 0.0604657 s | 0.0258631 s | 6.44138e-06 s | 6.54842e-06 s | ||
10000 | 0.00545731 s | 0.0029774 s | 5.30458e-06 s | 5.58388e-06 s | ||
1000 | 0.00149441 s | 0.00137508 s | 3.64947e-06 s | 3.67336e-06 s |
From here it's easy to calculate breakeven points for each order of magnitude:
# Primitives | Traversals-per-construction |
---|---|
1637530 | 146,000 |
1000000 | 265,000 |
100000 | 323,000 |
10000 | 8,880 |
1000 | 4,990 |
The relationship isn't actually monotonically increasing or decreasing. It looks as though fast-construction is most effective at somewhere around 100,000 primitives. As the tree gets larger, the utility of the new construction strategy declines gradually, and when the tree is smaller it loses its advantage rapidly. For a small 1000-primitive tree, you only need to do over 5,000 traversals before fast-construction is no longer worthwhile!
4.12.2 More diverse test data
To evaluate how the nature of the data affects the new construction strategy's advantages, I standardized the testing procedure and created synthetic datasets with Andreas' help.
4.12.2.1 Gargoyle
I converted the gargoyle dataset to .off
, and rather than simplifying the mesh I
randomly removed a subset of its faces.
This change to how the number of primitives is reduced is actually significant,
because it makes the mesh more sparse in addition to reducing the size of the tree.
This will generally result in decreasing hit rates as the dataset is reduced,
but because the change is being applied uniformly across our tests this should be acceptable.
# Primitives | Recursive Partition Construction | Hilbert Sort Construction | Recursive Partition Traversal | Hilbert Sort Traversal | ||
---|---|---|---|---|---|---|
1637530 | 0.68504 s | 0.434351 s | 6.0408e-06 s | 7.48615e-06 s | ||
1000000 | 0.411266 s | 0.207592 s | 5.13054e-06 s | 5.81011e-06 s | ||
100000 | 0.0292015 s | 0.0108758 s | 1.10307e-06 s | 1.4085e-06 s | ||
10000 | 0.0024164 s | 0.00109091 s | 4.99561e-07 s | 6.05938e-07 s | ||
1000 | 0.00018549 s | 0.000163817 s | 3.6545e-07 s | 2.94521e-07 s |
These results mostly look similar to the previous measurements, but things begin to diverge as the number of primitives gets very low. Because the mesh is much more sparse, fewer of the primitives descend the tree as far. This makes the effect of a higher quality tree less visible, and the Hilbert-constructed tree even beats the conventional one when there are only 1,000 elements. Luckily, we still get good data for larger trees, and the results mirror the previous test as expected.
# Primitives | Traversals-per-construction |
---|---|
1637530 | 173,445 |
1000000 | 299,710 |
100000 | 60,000 |
10000 | 12,460 |
1000 | (inf) |
When we use the same formula to find our breakeven points, we get mostly reasonable results. The numbers tend to fall within the same orders of magnitude as in the previous test.
4.12.2.2 Rotated Sphere-grid
The dataset for the next test was synthesized using the Polyhedron demo. It consists of an 8x8x8 cubic grid of sphere meshes with an affine transform applied so that they aren't perfectly on axis. The gargoyle mesh formed a "shell" which the rays were cast outwards from, this mesh is very different from that, and that may have consequences for our performance.
- The rays are shot from outside the spheres, which means that they aren't guaranteed to hit any faces.
- The spheres are densely distributed around the source, rays are more likely to hit many faces, rather than just one.
# Primitives | Recursive Partition Construction | Hilbert Sort Construction | Recursive Partition Traversal | Hilbert Sort Traversal | ||
---|---|---|---|---|---|---|
842400 | 0.425959 s | 0.186575 s | 1.04627e-05 s | 1.15029e-05 s | ||
100000 | 0.0369721 s | 0.0111089 s | 3.20296e-06 s | 3.64699e-06 s | ||
10000 | 0.00260251 s | 0.00106912 s | 3.8749e-07 s | 4.28021e-07 s | ||
1000 | 0.000217795 s | 0.000175691 s | 5.56993e-08 s | 5.78713e-08 s |
At a glance, these results seem to show a larger advantage for the Hilbert sort, after all, the construction is more than twice as fast and traversal is only slightly slower. This may not have as big an affect on the breakeven point as it seems at first: for tests on this dataset, construction was especially fast and traversal was especially slow for either traits class. Practically speaking, this means that the value of a relatively faster construction is lower, because times are more likely to be driven by the number of traversals.
# Primitives | Traversals-per-construction |
---|---|
842400 | 230,133 |
100000 | 58,247 |
10000 | 37,833 |
1000 | 19,385 |
It seems as though the two opposing effects mostly cancel out. The breakeven point tends to be within the same orders of magnitude as in previous tests.
4.12.2.3 Mixed Sphere-grid
Next, we produced another dataset by merging the affine-transformed sphere grid with the axis-aligned pre-transformation version. Besides increasing primitive count, this also has the effect of increasing overall density and reducing the likelihood that a ray will hit nothing.
# Primitives | Recursive Partition Construction | Hilbert Sort Construction | Recursive Partition Traversal | Hilbert Sort Traversal | ||
---|---|---|---|---|---|---|
1684800 | 0.924095 s | 0.488123 s | 2.49734e-05 s | 2.78055e-05 s | ||
1000000 | 0.508275 s | 0.217452 s | 2.00373e-05 s | 2.10856e-05 s | ||
100000 | 0.0351812 s | 0.0113362 s | 8.32743e-06 s | 9.66192e-06 s | ||
10000 | 0.00257242 s | 0.00102682 s | 4.33653e-06 s | 4.68058e-06 s | ||
1000 | 0.000201988 s | 0.000168705 s | 1.48951e-06 s | 1.46143e-06 s |
This test exhibits the same issue as the first, where for small dataset sizes the traversal data is potentially unreliable, and the Hilbert constructed tree beats the conventional method.
# Primitives | Traversals-per-construction |
---|---|
1684800 | 153939 |
1000000 | 277423 |
100000 | 17868 |
10000 | 4492 |
1000 | (inf) |
Once again, we see results with the same magnitude as the other tests.
4.12.3 Switching from Median to Middle Hilbert Policy
The Hilbert sorting package provides two different strategies, the one we've been using so far splits based on the median of the points, but we might be able to get better results by switching to the "middle" policy.
4.12.3.1 Gargoyle
# Primitives | Recursive Partition Construction | Hilbert Sort Construction | Recursive Partition Traversal | Hilbert Sort Traversal | ||
---|---|---|---|---|---|---|
1637530 | 0.694047 s | 0.438711 s | 5.88445e-06 s | 7.44739e-06 s | ||
1000000 | 0.406627 s | 0.204171 s | 5.05238e-06 s | 5.77835e-06 s | ||
100000 | 0.0294115 s | 0.01054 s | 1.0871e-06 s | 1.45127e-06 s | ||
10000 | 0.00239611 s | 0.00103612 s | 5.37958e-07 s | 6.09279e-07 s | ||
1000 | 0.0001899 s | 0.0001647 s | 3.39e-07 s | 2.95038e-07 s |
# Primitives | Traversals-per-construction |
---|---|
1637530 | 163,369 |
1000000 | 278,877 |
100000 | 51,821 |
10000 | 19,069 |
1000 | (inf) |
For this data, performance is consistently slightly worse, meaning that the breakeven point is lower.
4.12.3.2 Rotated Sphere-grid
# Primitives | Recursive Partition Construction | Hilbert Sort Construction | Recursive Partition Traversal | Hilbert Sort Traversal | ||
---|---|---|---|---|---|---|
842400 | 0.419961 s | 0.187204 s | 1.03394e-05 s | 1.13945e-05 s | ||
100000 | 0.0360399 s | 0.0110656 s | 3.34002e-06 s | 3.62142e-06 s | ||
10000 | 0.00264938 s | 0.00108409 s | 3.97501e-07 s | 4.16641e-07 s | ||
1000 | 0.000221205 s | 0.000191903 s | 5.85699e-08 s | 6.01912e-08 s |
# Primitives | Traversals-per-construction |
---|---|
842400 | 220,602 |
100000 | 88,750 |
10000 | 81,781 |
1000 | 18,073 |
Here, things are a little bit more interesting. The new strategy is slightly better when the number of primitives is small, but performs worse for the largest set.
4.12.3.3 Mixed Sphere-grid
# Primitives | Recursive Partition Construction | Hilbert Sort Construction | Recursive Partition Traversal | Hilbert Sort Traversal | ||
---|---|---|---|---|---|---|
1684800 | 0.920223 s | 0.486563 s | 2.50497e-05 s | 2.79779e-05 s | ||
1000000 | 0.505539 s | 0.215818 s | 2.01638e-05 s | 2.11468e-05 s | ||
100000 | 0.035145 s | 0.0111322 s | 8.34144e-06 s | 9.48135e-06 s | ||
10000 | 0.00264618 s | 0.0010278 s | 4.33139e-06 s | 4.66359e-06 s | ||
1000 | 0.000208092 s | 0.000167298 s | 1.52239e-06 s | 1.53896e-06 s |
# Primitives | Traversals-per-construction |
---|---|
1684800 | 148,098 |
1000000 | 294,731 |
100000 | 21,066 |
10000 | 4,872 |
1000 | 2,462 |
Here, the difference is less consistent, with the new strategy winning some and losing others. These results aren't reliable enough in my eyes to justify switching to the middle-strategy as a default.
4.12.4 More exploration of Implicit Tree Structures
The implicit tree structure I previously created was based on my earlier n-way-tree work, but in order to have an atomic PR, I need to build the same functionality on top of master. An important luxury I have when working on this is that the AABB tree's node class is fully internal, and it isn't mentioned in the documentation. This means that I have few restrictions on changing its API, and I can make some relatively drastic changes to how it works without breaking compatibility with users' code.
Unlike last time, I decided to base this implementation on the concept of a "Node handle". The node type won't contain any direct information about the state of the node it represents, instead, it acts like an elaborate iterator type. To do this, it needs to contain three things:
std::vector<Bounding_box> &m_boxes
a reference to a collection of boxesstd::vector<Primitive> &m_primitives
a reference to a collection of primitivesstd::size_t m_index
the index of the relevant node
The node's public API provides a superset of the functionality of the original node. Here's a summarized version of its public methods:
class AABB_node { public: AABB_node(std::vector<Bounding_box> &boxes, std::vector<Primitive> &primitives, std::size_t index = 0); void expand(ConstPrimitiveIterator first, ConstPrimitiveIterator beyond, const ComputeBbox &compute_bbox, const SplitPrimitives &split_primitives, const AABBTraits &); void traversal(const Query &query, Traversal_traits &traits) const; public: // Existing functionality inline const Bounding_box &bbox(); inline Bounding_box &bbox(); inline const Primitive &primitive() const; inline Primitive &primitive(); inline Node left_child() const; inline Node right_child() const; // New functionality inline bool is_leaf() const; }
A lot of the functionality of the original node has been renamed, and the way it's implemented has changed completely:
bbox()
provides access to the node's associated bbox, using the node's index and its reference to the bbox collection.primitive()
is similar, though some math is necesary to determine the index of the primitive inm_primitives
.left_child()
andright_child()
provide access to children by constructing them on-the-spot, since the node handles don't actually exist in an arrayexpand()
uses a similar recursive algorithm to the original, and it was actually simplified by the fact that the connections between nodes already exist implicitlytraversal()
was also simplified, because it's now easier to determine if a node is a leaf node
The additional context that the node handle has also makes it easy to
add new functionality, which the original tree couldn't provide.
For example is_leaf()
as mentioned earlier, was as simple as comparing the node's index
to the number of non-leaf nodes.
There's also an opportunity to add additional functionality in the future, like the following:
parent()
can find the parent using a simple equation of the indexchildren()
can be found as a list, because they are always adjacentsiblings()
can be done similarly
Changes to the tree itself are much smaller, and don't affect its API. Importantly, the tree's node array is replaced with a bbox vector. The tree also holds an instance of the root node (index 0). This is only necessary so that the root accessor can still return a pointer, otherwise it would have made more sense to construct this node on request, like all other nodes.
4.12.4.1 Use of Assertions
One feature that was central to my development process was the way the node handle class uses assertion statements. I applied assertions liberally, the implicit nature of the tree would have made it hard to catch certain types of errors otherwise. Often, I used assertions to validate assumptions rather than as preconditions. For example, here's one way assertions are used during the construction of a tree:
... // Stop when we've reached a leaf node if (is_leaf()) { // Make sure we're only trying to assign a single primitive to this node CGAL_assertion_msg( beyond - first == 1, "Constructor attempted to assign more than one primitive to a leaf node" ); // Make sure the primitives table is aligned with the bbox table CGAL_assertion_msg( first - m_primitives.get().begin() == primitive_index(), "Primitive misalignment! " "The tree's topology should guarantee that primitives are aligned with their boxes." ); } else { ...
Because of the way our implicit construction works, the recursive method has an empty base case. When we reach a leaf node, it should already have all of its connections, and it should be associated with its primitive based on its index. Instead of eliminating this space entirely, I used it to check that the node was imbued with the right properties. I check that the leaf node is being associated with exactly one primitive, as well as that the leaf node can correctly locate the primitive it's associated with.
4.12.5 "Fully boxed" vs "Leafless" trees
A topic that's already come up several times is whether each primitive should have its own bounding box. Here, I'll refer to the options as "Leafless" and "Fully boxed".
- A Leafless tree does not put boxes around each primitive; instead, the deepest boxes in the
tree surround pairs
of primitives. I'm referring to this as "leafless" because it doesn't
involve any nodes that don't hold more than one child. This is how
CGAL's existing AABB tree works, and it comes with a couple of
implications:
- The tree needs fewer nodes to hold the same number of primitives.
- A node can hold either two primitives, two child nodes, or a primitive and a node.
- Traversal and construction are more complicated, to account for the different states a node can have.
- A Fully-boxed tree gives every primitive its own node. This is how
the current version of the N-way tree works as well as the implicit
tree, it has its own implications.
- Nearly twice as many nodes are necessary to hold the same number of primitives.
- A node can hold either a primitive or a pair of leaves, with no in-between.
- Child nodes can be held in contiguous arrays, since they aren't mixed in with primitives.
- Recursive traversal and construction are trivial and intuitive, because nodes can only be in two different states, with leaf nodes being the base-case.
- Performance tends to be better by default, because the last bbox acts as a geometric filter for primitives that don't already have a bbox pre-check for intersections.
The AABB tree supports a Bbox-map which saves the result of finding the bbox for a primitive. This might be able to provide the same performance of the fully-boxed tree without needing the same amount of additional space. In order to look at the tradeoffs holistically, it's necessary to create a leafless version of our implicit tree.
4.12.5.1 Leafless Implicit Tree
The implicit tree was relatively easy to convert to a leafless structure, using the following changes:
- The box vector was reduced in size, to only include the ones necessary for this configuration.
- The constructor was changed to only set the bbox of non-leaf nodes (attempts to set it would otherwise cause out-of-bounds errors.
- The math used to find the primitive for a node was changed to account for the new number of boxes.
- Traversal needed additional logic to prevent attempting to intersect with the boxes of leaf-nodes.
Conceptually, the change to our "table" converts it from this:
B B B B B B B B B B B B B P P P P P P P
to this:
B B B B B B P P P P P P P
The way we use indices in our node handles remains the same, the only difference being that certain indices don't have boxes associated with them.
4.12.6 Benchmarking tree configurations
(Testing here is done with the Epic Kernel and gargoyle_1637530.off
dataset)
First, we want to see if taking the same tree topology and making it implicit hurts performance, to do this we can look at how the leafless versions of both trees compare.
Performance | Master (leafless) | Implicit (leafless) |
---|---|---|
Construction Time | 0.941639 s | 0.97235 s |
Traversal Time | 5.76427e-06 s | 6.60947e-06 s |
Memory Usage | 404.0MiB | 352.8 MiB |
Here, the conversion to implicit comes with a small but measurable performance penalty for both construction and traversal. The slowdown to traversal is not surprising, as I haven't made any optimizations to retrieving properties of nodes (many redundant array accesses happen during traversal). The construction slowdown is, on the other hand, surprising. A lot of logic was removed from the construction code, so I'm uncertain where this penalty comes from.
Memory usage goes down, but not by very much. The primitives themselves are likely responsible for a much larger portion of the memory used than the tree's structure, since bounding boxes and pointers are relatively small compared to high-precision coordinate types. Moreover, even for the massive dataset used here, memory is far from the limiting factor.
We also want to see whether using a bbox-map can help us get better performance out of a leafless tree. For this we can compare the fully-boxed implicit tree to both leafless implementations with the addition of the bbox map. The N-way tree is also included in this benchmark as an example of a fully-boxed architecture that uses explicit connections between nodes.
Performance | Master (leafless + bbox-map) | Implicit (leafless + bbox-map) | Implicit (fully-boxed) | N-way (fully-boxed) |
---|---|---|---|---|
Construction Time | 0.758121 s | 0.840891 s | 0.960223 s | 0.672876 s |
Traversal Time | 5.78328e-06 s | 6.59443e-06 s | 5.64262e-06 s | 5.40254e-06 s |
Memory Usage | 452.7 MiB | 427.7 MiB | 446.0 MiB | 479.0 MiB |
5 Proposed Optimizations
Approach | Predicted Yield | Intrusiveness | Description |
---|---|---|---|
N-Way Splitting | High | High | Wide BVH like Embree uses, each node holds N >= 2 children, where N is the # of SIMD lanes. |
Parents hold child bounds | Medium | Low | Rather than each node containing its own boundary, nodes can hold the bounding boxes of each of their children. This improves the arrangement in memory, potentially allowing easier vectorization. |
Child-skipping | ? | Medium | If a node contains a complete tree to depth N, gather all (great)grandchildren at depth N and perform intersections with them (SIMD allows us to perform intersections on multiple nodes without much additional cost). |
SIMD Intersection Func | Low | Low | Fine grained vectorization of the intersection algorithm, no changes outside the function itself. This is likely already being done by the compiler. |
Data Alignment | Low | Medium | Make sure that the AABB-tree's data is stored in memory-aligned data structures, for faster loading into SIMD registers. |
Stream Traversal | Medium | High | Queries composed of multiple similar rays, improving cache performance and making use of SIMD for simultaneous intersections. |
Query Boxing | ? | Medium | Place the query in a BBox or set of BBoxes, to avoid the need to use high-precision intersection approaches until intersecting with primitives. |
Subtree Collapsing* | Medium | Medium | Leaf nodes have a "bucket size", can hold more than 2 primitives. |
Pointer-less nodes* | High | High | Use index or pointer math to access children, so that each node doesn't need to store a reference |
Spatial Sort Construction* | High | Low | Build the tree quickly by sorting the primitives along the Hilbert curve. |
* Certain optimizations may not result in increased use of SIMD instructions.
6 Queue
Upcoming work:
Add description column to tree comparison tableSign INRIA copyright transferExamine Ray-BBox intersection implementationDetermine how SIMD may be applied.Extract real examples of ray-bbox intersections from tetrahedral remeshing examples, for use in benchmarking.Create benchmark from real examplesAdd a ray-bbox intersection implementation using xsimdCache broadcasted ray type, in case repeated broadcast scalar operations are hurting performance.
Move experimental work to a branch on cgal-public-devConsider appropriate SIMD librariesCreate a batched-bbox memory arrangement to compare with array-of-structs and struct-of-arrays arrangements.Assemble slides to explain the current progress and central challenge of the project.Convert CGAL's existing intersection function to branchless, to see how that interacts with the child-skipping optimizationDetermine the reason CGAL's (unaltered) Mesh code examples fail on my deviceProfile provided benchmarksGoal is to discover whether most time is spent traversing the tree (bbox intersections), or directly testing for intersections with primitives
Examine BBox-BBox intersection implementationCreate a simd-friendly version
Benchmark BBox-BBox vs Ray-BBox vs Ray-Primitive [triangle] intersection.Benchmark Ray-BBox vs Boxed-ray-BBox intersectionCreate an N-way branch on cgal-public-devDocument the advantages of 1-node-per-primitiveDetermine a breakeven point for hilbert sort construction- Produce plots which demonstrate the breakeven point
Add concurrency parameter to new traits classMoveexpand
's template specialization so it's not inline- Add doxygen documentation for new traits class
- API
- Examples
- Information about tradeoffs
- Benchmark different types of implicit trees