Inria SIMD Blog

Note: This page is mirrored from my page on CGAL's internal wiki, apologies for any incorrect formatting.

Contents

1 Introduction

CGAL’s AABB-Tree is an acceleration structure which speeds up common tasks such as collision-detection. It is used both directly and throughout other packages in the library, so any performance improvements made to this package will pay dividends elsewhere. The package has already been a major target of performance optimization, but one approach that hasn’t yet been fully explored is the use of SIMD, a type of parallel computation where the processor operates on several data values at a time. The introduction of SIMD optimizations to the AABB tree has the potential to substantially improve performance by better taking advantage of the capabilities of modern computer processors.

2 Project Goals

2.1 Minimum Viable Product

  • Benchmarks which compare the performance of CGAL’s AABB-tree and other implementations with realistic workloads. This includes benchmarks of both intersection and distance query functionality, for a variety of primitives.
  • Documentation of targets for improvement in the current implementation, based on profiling with large-scale benchmarks.
  • Selection of a SIMD library, and incorporation of SIMD techniques into the AABB-tree.
  • Benchmarks which demonstrate an improvement in performance when SIMD is enabled.

2.2 Stretch Goals

  • Performance that approaches that of Embree’s Bounding Volume Hierarchy.
  • Application of similar optimization techniques to other tree packages in CGAL, including the kD-Tree and the Orthtree, the subject of my Google Summer of Code assignment in 2020.

3 Timeline

3.1 Now - June 7

Community Bonding Period

  • Learn about the AABB-tree’s functionality as well as its implementation details.
  • Build benchmarks for existing tree structures
  • Select a library for SIMD

3.2 June 7 - June 21

  • Collect performance data through profiling using Intel’s VTune
  • Assemble a table with areas where Embree has the largest performance advantages

[Benchmark Pull Request Merged]

3.3 June 21 - July 7

  • Document optimizations used by Embree
  • Rank optimizations by expected performance yield, and feasibility with respect to the AABB-tree’s current structure.

3.4 July 7 - July 21

  • Apply high-yield optimizations which don’t conflict with AABB-tree’s existing structure.
  • Apply any optimizations which are self-contained (for example, ones which only require changes to individual functions)
  • Benchmark each change, and use unit tests to confirm that no changes conflict with one another

3.5 July 21 - August 7

  • Apply high-yield optimizations which may require more invasive changes to AABB-tree’s implementation
  • Benchmark new changes, repeat integration test

3.6 August 7 - August 14

  • Apply any remaining optimizations that only require simple changes to the AABB-tree.
  • Extend AABB-tree manual to discuss all new optimizations.

[Optimization Pull Request Merged]

4 Progress

4.1 May 1 - May 17

4.1.1 Architectural Analysis

Ayush Saraswat's work during GSoC 2020 demonstrated a massive performance advantage in favor of Embree (over CGAL's AABB-tree). It remains to be seen how much of this advantage comes from SIMD, and how much is architectural.

Pierre Alliez suggested I look at a particular survey of contemporary Bounding Volume Hierarchy design practices. The survey discusses a number of features and optimizations which trees might include. Characterizing both trees with regard to this survey was a good way of doing a motivated tour of the code-bases.

FeatureCGALEmbreeDescription
Construction StrategyTop-downTop-downIs the tree built root-first (then split), or leaves first (then combined)?
Incremental ConstructionYesNoCan primitives be added to the tree without completely reconstructing it?
Linear BVHNoYes (Morton codes)Can the tree be built faster by first sorting primitives along a space-filling curve?
Topology OptimizationNo?Yes (Binned SAH)Can the tree be refined so that it's optimal by some measure, for example minimal total surface area of bounding boxes?
Subtree CollapsingNoNo?Can leaf nodes contain more than N primitives?
Data Layout ApproachNaive?Preallocated, alignedHow is the tree explicitly arranged in memory to reduce cache misses during common operations?
Hardware AccelerationMultithreadingMultithreading, SIMDWhat hardware features does the tree take advantage of?
Spatial SplitsNoYesCan primitives be broken into multiple parts along bbox boundaries?
Wide BVHNoYes (N-way splits)Can the space be partitioned in more than two ways at each node?
Non-Polygonal ObjectsNoYesDoes the tree provide optimized support for different primitive types?
Compact RepresentationLower-precision bboxNaive?What efforts does the tree make to avoid taking up too much memory?
First-hit TraversalYesYesCan you use the tree to find the first primitive a ray intersects with?
Any-hit TraversalYesYesCan you use the tree to find out whether or not a ray intersects with any primitives?
Multi-hit TraversalYesNoCan you use the tree to find the list of primitives a ray intersects with?
Stream TraversalNo?YesDoes the tree provide accelerated support for intersections with groups of similar rays?
Stackless TraversalNoYes? (Skip-links)Is traversal done non-recursively, instead following a ray directly between adjacent nodes?
Ray-locality Opt.NoNoCan you traverse the tree without loading faraway nodes into memory?

CGAL's AABB-tree seems a lot more "vanilla" compared to Embree, and I think that's because a lot of Embree's optimizations are incompatible with CGAL's high level of genericity. For example, it might be impossible to generate morton codes for a lot of CGAL's kernels, and explicit data layout might not be useful for variable-precision types. CGAL naturally can't provide a lot of the same guarantees about the data being worked with, and that excludes certain optimizations.

I think Embree's Wide BVH implementation is a good example of how it takes advantage of SIMD. Each node of the tree has bounds defined using a custom vector type of size N (e.g. vfloat<N> lower_x). When N is less than the number of SIMD units, bounds checking can be done in a similar number of instructions to a conventional (two-child) tree! Notably, the underlying data structure of the vfloat<N> type is a simple C-style array. As I understand it, the team that develops Embree works closely with the Intel compiler team to make sure that their code can be compiled to packed SIMD instructions, rather than using intrinsics explicitly.

I was hoping Ayush Saraswat's benchmarks would be indicative of the performance gain SIMD can offer, but it seems like the two trees might just be too different to compare directly like that. Embree has some advantage because it doesn't need to guarantee exactness, and some because of its more elaborate architecture. In order to determine where SIMD is making the biggest difference, it's necessary to either quantify or eliminate these other advantages.

4.1.2 Prevalence of SIMD Instructions in Existing Code

Before attempting to introduce SIMD, an important question is how much the existing code already takes advantage of vectorized instructions. Modern compilers are very adept at vectorizing code that may not appear to use SIMD, by unrolling loops or recognizing blocks of code with implicitly parallel behavior.

The laptop I'm working on has an Intel i7-7700HQ CPU, which supports instruction sets up to SSE 4.2 and AVX 2. Depending on what compiler flags are set in CGAL's default release config, this means that the emitted assembly should at least contain some SSE instructions, and perhaps AVX instructions too (though in my experience, AVX must be explicitly enabled).

To determine this, I compiled the AABB-tree's Ray-shooting example in release mode, with no modifications to CGAL's default compiler flags. I counted packed SIMD instructions with the help of awk (piped to wc), a modification of the technique discussed here.

Instruction Set# of Instructions % of Program
All23,256100%
SSE4461.917%
SSE24011.724%
SSE300.00%
SSSE300.00%
SSE400.00%
AVX00.00%

To my surprise, the program contains no SIMD instructions beyond SSE2. It remains to be seen what the cause of this is, but this is actually very good news for this project: because the existing code isn't already taking full advantage of the CPU, I have a lot of performance headroom!

Further examination shows that CGAL's default release config for CMake on Fedora does not enable -march=native. This is because it's intended to be distributed as a binary (with broad hardware compatibility). Because CGAL is a header-only library, users don't need to accept the same compromise-- they can build with whatever flags they like.

4.2 May 18 - June 1

4.2.1 Examining Ray-BBox Intersection

According to included comments, CGAL's ray-bbox intersection algorithm is inspired by this paper. On Pierre's suggestion, I'm exploring how SIMD could be used to accelerate this operation.

NOTE: Technically this algorithm performs intersections between line segments and bounding boxes, but it can generalize to infinite rays. Both the algorithm and CGAL refer to it as Ray-BBox intersection, so I will remain consistent with that pattern.

First, I read through the discussed algorithm, and traced its operation on paper for a very simple example case.

Next, I put together a simplified version of the algorithm in the paper. This is somewhat closer to psuedocode than the paper, it's also written more verbosely, to help understand exactly what's happening.

class BBox {
    Vector3 bounds[2];
}
 
class Ray {
    Vector3 origin, direction, inv_direction;
    int sign[3]; // I'd like a more semantically appropriate way of representing this!
}
 
bool intersects(BBox box, Ray ray, float rmin, float rmax) {
 
    // Determine bounds for x and y
    float xmin = (box.bounds[ray.sign[0]].x - r.origin.x) * r.inv_direction.x;
    float xmax = (box.bounds[1 - ray.sign[0]].x - r.origin.x) * r.inv_direction.x;
    float ymin = (box.bounds[ray.sign[1]].y - r.origin.y) * r.inv_direction.y;
    float ymax = (box.bounds[1 - ray.sign[1]].y - r.origin.y) * r.inv_direction.y;
 
    // If the x and y bounds don't overlap, the ray doesn't intersect with the box
    if (xmin > ymax || ymin > xmax) return false;
 
    // Determine the bounds of the overlapping region
    min = MAX(xmin, ymin);
    max = MIN(xmax, ymax);
 
    // Determine bounds for z
    float zmin = (box.bounds[ray.sign[2]].z - r.origin.z) * r.inv_direction.z;
    float zmax = (box.bounds[1 - ray.sign[2]].z - r.origin.z) * r.inv_direction.z;
 
    // If the z bounds don't overlap with the existing region, the ray doesn't intercept
    if (zmin > max || min > zmax) return false;
 
    // Update the bounds to find the region where x, y, and z overlap
    min = MAX(min, zmin);
    max = MIN(max, zmax);
 
    // The ray intercepts if this region overlaps with the bounds provided
    return (min < rmax) && (max > rmin);
 
}

Once I had a starting point, it was possible to start thinking about how the code could be vectorized. Intel's official guide was a useful starting point when reasoning about what autovectorizer-friendly code looks like.

A more SIMD-optimal version might look something like this:

(Assignments that are grouped together are those that I think will be vectorized, e.g. all mins will be calculated as a group)

bool intersects(BBox box, Ray ray, float rmin, float rmax) {
 
    // Determine minimum bounds for x, y, and z
    float xmin = (box.bounds[ray.sign[0]].x - r.origin.x) * r.inv_direction.x;
    float ymin = (box.bounds[ray.sign[1]].y - r.origin.y) * r.inv_direction.y;
    float zmin = (box.bounds[ray.sign[2]].z - r.origin.z) * r.inv_direction.z;
 
    // Determine maximum bounds for x, y, and z
    float xmax = (box.bounds[1 - ray.sign[0]].x - r.origin.x) * r.inv_direction.x;
    float ymax = (box.bounds[1 - ray.sign[1]].y - r.origin.y) * r.inv_direction.y;
    float zmax = (box.bounds[1 - ray.sign[2]].z - r.origin.z) * r.inv_direction.z;
 
    // Determine the minimum bounds of the region where x, y, and z overlap
    min = MAX(xmin, ymin, zmin);
 
    // Determine the maximum bounds of the region where x, y, and z overlap
    max = MIN(xmax, ymax, zmax);
 
    // The ray intercepts if this region exists and overlaps with the bounds provided
    return (max > min) && (min < rmax) && (max > rmin);
 
}

Notice how we lose our early-exits: this function is now branch-free, for better or worse. This enables wider SIMD operations (finding all three mins at once rather than finding z separately), but it may mean that this has worse best-case performance.

The compiler may pad the vector values so that they contain four elements rather than three. If it's smart enough, this can enable the use of appropriate instructions to rapidly find the min and max values.

This code still finds the min and max values separately, but the compiler likely splits things up more to enable even wider SIMD operations. The following code might compile to exactly the same thing; because it's only a change to the order of operations, the compiler should be smart enough to make this optimization automatically:

(In this code I also explicitly acknowledge the padding values I might expect the vectorizer to add)

bool intersects(BBox box, Ray ray, float rmin, float rmax) {
 
    // Determine intermediate value for minimum bounds
    float xmin = box.bounds[ray.sign[0]].x;
    float ymin = box.bounds[ray.sign[1]].y;
    float zmin = box.bounds[ray.sign[2]].z;
    /* pad_min = FLOAT_MAX; */
 
    // Determine intermediate value for maximum bounds
    float xmax = box.bounds[1 - ray.sign[0]].x;
    float ymax = box.bounds[1 - ray.sign[1]].y;
    float zmax = box.bounds[1 - ray.sign[2]].z;
    /* pad_max = FLOAT_MIN; */
 
    // Apply transform to all bounds
    float xmin = (xmin - r.origin.x) * r.inv_direction.x;
    float ymin = (ymin - r.origin.y) * r.inv_direction.y;
    float zmin = (zmin - r.origin.z) * r.inv_direction.z;
    /* pad_min = (pading - 0.0)      * 1.0; /*
    float xmax = (zmax - r.origin.x) * r.inv_direction.x;
    float ymax = (ymax - r.origin.y) * r.inv_direction.y;
    float zmax = (zmax - r.origin.z) * r.inv_direction.z;
    /* pad_max = (pading - 0.0)      * 1.0; /*
 
    // Determine the minimum bounds of the region where x, y, and z overlap
    min = MAX(xmin, ymin, zmin /*, pad_min */);
 
    // Determine the maximum bounds of the region where x, y, and z overlap
    max = MIN(xmax, ymax, zmax /*, pad_max */);
 
    // The ray intercepts if this region exists and overlaps with the bounds provided
    return (max > min) && (min < rmax) && (max > rmin);
 
}

This arrangement is mathematically equivalent, and the compiler may choose to do something like this if it determines that it results in better performance (though I'm not certain that's the case!).

Note: These comments are primarily speculative, based on my understanding of intel's documentation.

4.2.2 Choosing a SIMD Approach

Note: ISPC is discussed in this fantastic blog

SIMD Approaches in order of Explicitness
Most Readable
Compiler autovectorization
OpenMP (#pragma omp simd)
High level libraries (e.g. Eigen)
Low level libraries (wrappers over intrinsics)
SIMD intrinsics
Embedded assembly
Most Explicit
Advantages and Disadvantages of Different SIMD Approaches
ApproachProsCons
Compiler autovectorization
  • Doesn't require additional semantics, meaning that readability of code may be less impacted
  • No libraries or additional dependencies needed
  • Compiler may find opportunities for optimization that would otherwise go undetected
  • Results in "invisible compiler magic": code that is arranged in a specific way for no clear reason
  • Different optimizations will be found by different compilers, meaning performance might be less consistent between users
OpenMP (#pragma omp simd)
  • Greater level of control, more confidence that SIMD will actually be used where we think it's important
  • Adds clarity where SIMD techniques are being used (unlikely to be removed accidentally during refactoring)
  • Only applicable a limited number of constructs (e.g. for loops)
  • Adding a #pragma doesn't always result in additional vectorization (the compiler already does pretty well without hints)
High level libraries (e.g. Eigen)
  • Already thoroughly optimized to make good use of SIMD
  • Probably not usable within CGAL
Low level libraries (wrappers over intrinsics)
  • Very high level of control over how and when SIMD is applied
  • Where library constructs are used, there is a strong guarantee that SIMD instructions will actually be emitted
  • Allow for refactoring with full confidence: as long as no library constructs are removed, the code can be transformed radically without reducing the use of SIMD
  • Puts a lot of responsibility on the engineer; it only makes sense to work this low-level if we know we can write better SIMD code than the compiler!
  • By writing explicit SIMD code, we risk cutting off implicit avenues of optimization for the compiler.
  • Only gives access to "lowest common denominator" SIMD instructions, some niche instructions only available on certain architectures might be useful, but won't be available
SIMD intrinsics
  • Even higher level of control over how and when SIMD is applied
  • Allows us to use niche intrinsics which are useful for our situation
  • Requires careful handling for cross-platform support
  • Puts even more responsibility on the engineer; it only makes sense to work this low-level if we know we can write better SIMD code than the compiler!
  • By writing explicit SIMD code, we risk cutting off implicit avenues of optimization for the compiler
Embedded assembly
  • Highest possible level of control over how and when SIMD is applied
  • Allows us to use niche intrinsics which are useful for our situation
  • Requires careful handling for cross-platform support
  • Puts all responsibility on the engineer; it only makes sense to work this low-level if we know we can write better SIMD code than the compiler!
  • By saying exactly what assembly we want emitted, we cut off all avenues of optimization by the compiler

4.2.2.1 Libraries

LibraryLicenseNotes
Boost.SIMDClosedOriginally planned for inclusion in Boost, sadly that was cancelled and the library was made closed source.
nsimdMITFork of Boost.SIMD
xsimdBSD 3-clauseReimplementation of Boost.SIMD's API
VectorclassGPLv3Library by Agner Fog
libsimdppBoostProvides Dynamic Dispatch
std-simdBSD 3-clausestd::experimental::simd, might someday be included in libstdc++!
dimsumApache 2.0Google's own implementation of the same standard simd proposal.
SIMD EverywhereMITEnables the use of any flavor of intrinsics, on any platform (automatically substitutes unsupported intrinsics)
Advantages and Disadvantages of Different SIMD Libraries
LibraryProsCons
nsimd
  • Boost-style API
  • Vector length agnostic
  • Adds support for SPMD (shader-style) programming
  • Compatible with all major instruction sets
  • Low visible activity, most recent PR merged 3 months ago
  • Not especially well documented
xsimd
  • Boost-style API
  • Compatible with all major instruction sets
  • Very active (last PR merged 1 hour ago)
  • Well documented, including a doxygen manual
  • Doesn't automatically fallback to multiple smaller instructions when vector size is too large for the platform
Vectorclass
  • Stable, library has been around nearly as long as SIMD itself
  • Very well documented, including a guide with examples
  • Compatible with all major instruction sets
  • Relatively active (last changed 15 days ago)
  • Documentation is in the form of a PDF
  • Primarily developed by one person only
  • Doesn't have a well-defined VCS workflow
libsimdcpp
  • Compatible with all major instruction sets, as well as some niche ones
  • Has support for dynamic dispatch
  • Not vector length agnostic
  • Little recent activity, most recent PR merged nearly 2 years ago
std-simd
  • Targeted for inclusion in the standard template library
  • Somewhat documented; already has a page on ccpreference.com
  • Because the API is an ISO standard, stability is guaranteed and experience is transferable to other libraries which implement the standard
  • Compatible with all major instruction sets
  • Vector length agnostic
  • Low visible activity, most recent PR merged 4 months ago
  • Documentation is incomplete: there are no examples, and many functions are left undocumented
dimsum
  • Alternate implementation of the std-simd API
  • Developed at Google
  • Vector length agnostic
  • Not a complete implementation (missing certain features)
  • Currently only supports certain instruction sets
SIMD Everywhere
  • Different from other libraries: rather than providing a layer over SIMD intrinsics for different instruction sets, enables use of intrinsics for any platform on any other platform.
  • Compatible with Marc Glisse's existing (intrinsic-based) work, and may even allow removing the macros used for checking platform compatibility!
  • Allows access to more "niche" intrinsics, which are replaced with equivalent code on unsupported platforms
  • Very active (last PR merged 10 hours ago)
  • Requires direct use of intrinsics, foregoing the more ergonomic API provided by other libraries

In my opinion, the list can be reduced to the following libraries, each with their own advantages:

  • xsimd: the most actively developed
  • libsimdcpp: might be the only library with macros to help with dynamic dispatch
  • std-simd: the API is an ISO standard, and this may be included in a future version of C++
  • SIMD everywhere: very different from the rest, can add cross-platform support so Marc Glisse's existing code (but provides no convenience layer over intrinsics)

4.2.3 Automated SIMD Prevalence Analysis

The earlier process of determining how many SIMD instructions were present in a binary file was relatively time-consuming. Because this is something I expect to do frequently in the course of the project, I decided to automate it. I created a simple bash script which counts the instructions as I did manually, and also calculates their percents to produce a table like the one shown earlier.

This work is available on a new repository, which I intend to use for future such experimentation.

4.3 June 2 - June 7

4.3.1 Ray-BBox Intersection Codes

In order to see how different compiler flags, optimizations, and SIMD libraries effect the produced binary and its performance I'm using a minimal test case. I assembled a collection of ray-bbox intersection tests for spartan implementations of Vector3, BBox, and Ray types. These are inspired by the paper discussed earlier, with various modifications added.

Ray-BBox Intersection Implementations
NameCodeDescription
Smits' Method
bool intersect_smits_method(const BBox &bbox, const Ray &ray, float t0, float t1) {
    double tmin, tmax, tymin, tymax, tzmin, tzmax;
 
    if (ray.direction().x() >= 0) {
        tmin = (bbox.bounds()[0].x() - ray.origin().x()) / ray.direction().x();
        tmax = (bbox.bounds()[1].x() - ray.origin().x()) / ray.direction().x();
    } else {
        tmin = (bbox.bounds()[1].x() - ray.origin().x()) / ray.direction().x();
        tmax = (bbox.bounds()[0].x() - ray.origin().x()) / ray.direction().x();
    }
 
    if (ray.direction().y() >= 0) {
        tymin = (bbox.bounds()[0].y() - ray.origin().y()) / ray.direction().y();
        tymax = (bbox.bounds()[1].y() - ray.origin().y()) / ray.direction().y();
    } else {
        tymin = (bbox.bounds()[1].y() - ray.origin().y()) / ray.direction().y();
        tymax = (bbox.bounds()[0].y() - ray.origin().y()) / ray.direction().y();
    }
 
    if ((tmin > tymax) || (tymin > tmax)) return false;
    if (tymin > tmin) tmin = tymin;
    if (tymax < tmax) tmax = tymax;
 
    if (ray.direction().z() >= 0) {
        tzmin = (bbox.bounds()[0].z() - ray.origin().z()) / ray.direction().z();
        tzmax = (bbox.bounds()[1].z() - ray.origin().z()) / ray.direction().z();
    } else {
        tzmin = (bbox.bounds()[1].z() - ray.origin().z()) / ray.direction().z();
        tzmax = (bbox.bounds()[0].z() - ray.origin().z()) / ray.direction().z();
    }
 
    if ((tmin > tzmax) || (tzmin > tmax)) return false;
    if (tzmin > tmin) tmin = tzmin;
    if (tzmax < tmax) tmax = tzmax;
 
    return (tmin < t1) && (tmax > t0);
}

Included for completion's sake, this is a relatively naive approach discussed in the paper. The ray's inverse direction and sign are not pre-computed, so this should be expected to under-perform all other methods.

Improved
bool intersect_improved(const BBox &bbox, const Ray &ray, float t0, float t1) {
 
    double tmin, tmax, tymin, tymax, tzmin, tzmax;
 
    tmin = (bbox.bounds()[ray.sign()[0]].x() - ray.origin().x()) * ray.inv_direction().x();
    tmax = (bbox.bounds()[1 - ray.sign()[0]].x() - ray.origin().x()) * ray.inv_direction().x();
 
    tymin = (bbox.bounds()[ray.sign()[1]].y() - ray.origin().y()) * ray.inv_direction().y();
    tymax = (bbox.bounds()[1 - ray.sign()[1]].y() - ray.origin().y()) * ray.inv_direction().y();
 
    if ((tmin > tymax) || (tymin > tmax)) return false;
    if (tymin > tmin) tmin = tymin;
    if (tymax < tmax) tmax = tymax;
 
    tzmin = (bbox.bounds()[ray.sign()[2]].z() - ray.origin().z()) * ray.inv_direction().z();
    tzmax = (bbox.bounds()[1 - ray.sign()[2]].z() - ray.origin().z()) * ray.inv_direction().z();
 
    if ((tmin > tzmax) || (tzmin > tmax)) return false;
    if (tzmin > tmin) tmin = tzmin;
    if (tzmax < tmax) tmax = tzmax;
 
    return (tmin < t1) && (tmax > t0);
 
}

This is the modified version of smits' method proposed by the paper, which uses the pre-computed values for better performance. This approach will serve as the overall baseline.

Clarified
bool intersect_clarified(const BBox &bbox, const Ray &ray, float rmin, float rmax) {
 
    // Determine bounds for x and y
    double xmin = (bbox.bounds()[ray.sign()[0]].x() - ray.origin().x()) * ray.inv_direction().x();
    double xmax = (bbox.bounds()[1 - ray.sign()[0]].x() - ray.origin().x()) * ray.inv_direction().x();
 
    double ymin = (bbox.bounds()[ray.sign()[1]].y() - ray.origin().y()) * ray.inv_direction().y();
    double ymax = (bbox.bounds()[1 - ray.sign()[1]].y() - ray.origin().y()) * ray.inv_direction().y();
 
    // If the x and y bounds don't overlap, the ray doesn't intersect with the box
    if (xmin > ymax || ymin > xmax) return false;
 
    // Determine the bounds of the overlapping region
    double min = std::max(xmin, ymin);
    double max = std::min(xmax, ymax);
 
    // Determine bounds for z
    double zmin = (bbox.bounds()[ray.sign()[2]].z() - ray.origin().z()) * ray.inv_direction().z();
    double zmax = (bbox.bounds()[1 - ray.sign()[2]].z() - ray.origin().z()) * ray.inv_direction().z();
 
    // If the z bounds don't overlap with the existing region, the ray doesn't intercept
    if (min > zmax || zmin > max) return false;
 
    // Update the bounds to find the region where x, y, and z overlap
    min = std::max(min, zmin);
    max = std::min(max, zmax);
 
    // The ray intercepts if this region overlaps with the bounds provided
    return (min < rmax) && (max > rmin);
}

This is my own interpretation of the paper's approach, modified for clarity. I expect the use of semantic constructs like std::max to have no affect on performance, with identical binary being emitted by the compiler.

Branchless
bool intersect_branchless(const BBox &bbox, const Ray &ray, float rmin, float rmax) {
 
    // Determine bounds x, y, and z
    double xmin = (bbox.bounds()[ray.sign()[0]].x() - ray.origin().x()) * ray.inv_direction().x();
    double xmax = (bbox.bounds()[1 - ray.sign()[0]].x() - ray.origin().x()) * ray.inv_direction().x();
    double ymin = (bbox.bounds()[ray.sign()[1]].y() - ray.origin().y()) * ray.inv_direction().y();
    double ymax = (bbox.bounds()[1 - ray.sign()[1]].y() - ray.origin().y()) * ray.inv_direction().y();
    double zmin = (bbox.bounds()[ray.sign()[2]].z() - ray.origin().z()) * ray.inv_direction().z();
    double zmax = (bbox.bounds()[1 - ray.sign()[2]].z() - ray.origin().z()) * ray.inv_direction().z();
 
    // Determine the bounds of the overlapping region
    double min = std::max({xmin, ymin, zmin});
    double max = std::min({xmax, ymax, zmax});
 
    // The ray intercepts if this region overlaps with the interval provided
    return (max > min) && (min < rmax) && (max > rmin);
}

This is the first major modification aiming to improve SIMD performance. Removing the conditionals gives us straight-line code, but eliminates opportunities for early exits. This code should be more readily auto-vectorized by the compiler, but only benchmarks can tell whether that advantage is enough to overcome the lack of early exits in real-world scenarios.

As I add more implementations, they will be added to this table.

4.3.2 SIMD Prevalence for Different Implementations

For analysis, the prevalence test was restricted to only the relevant function (otherwise the testing framework and its use of the standard library can overwhelm the relevant statistics).

This proved to be the right choice: for example the smits' method program contained 1,957 instructions, but only 67 of those make up the intersection function itself!

4.3.2.1 Smits' Method

SIMD Instruction Prevalence (-O3)
Instruction SetCountPercent
All67100.00
SSE11.49
SSE (Packed)11.49
SSE23247.76
SSE2 (Packed)57.46
SSE300
SSSE300
SSE400
AVX00

This already contains a lot of SIMD instructions, but none above SSE2. Note that non-packed SIMD instructions operate on single values, and not on the entire SSE register. This means that they don't actually introduce parallelism to the code in the way packed instructions do.

SIMD Instruction Prevalence (-O3 -march=native)
Instruction SetCountPercent
All65100.00
SSE00
SSE (Packed)00
SSE200
SSE2 (Packed)00
SSE300
SSSE300
SSE400
AVX710.76

Once the compiler is allowed to use newer instructions, it replaces all the SSE and SSE2 instructions with AVX. Though this looks like lower use of SIMD, only 6 of the SSE instructions were packed, while all 7 AVX instructions are packed. The non-packed SSE instructions are only relevant when data is placed in SSE registers. With AVX the same tasks are accomplished with conventional instructions.

Note that in general, it wouldn't be unreasonable to see fewer total SIMD instructions when AVX is enabled. Because modern AVX instructions can operate on larger sets of values at a time, they should be able to accomplish the same tasks using fewer instructions.

4.3.2.2 Improved

SIMD Instruction Prevalence (-O3)
Instruction SetCountPercent
All71100.00
SSE11.40
SSE (Packed)11.40
SSE22636.61
SSE2 (Packed)22.81
SSE300
SSSE300
SSE400
AVX00

The improved version actually contains more instructions, despite its use of precomputed values. It also contains fewer packed SIMD instructions, perhaps due to additional logic precluding their use in certain cases.

When analyzing these results, it's important to be aware of two things:

  • The improved version accounts for an edge case not caught by the original.
  • More instructions does not always mean slower code (not all instructions are created equal).
SIMD Instruction Prevalence (-O3 -march=native)
Instruction SetCountPercent
All69100.00
SSE00
SSE (Packed)00
SSE200
SSE2 (Packed)00
SSE300
SSSE300
SSE400
AVX1217.39

Enabling AVX seems to help the improved version. Though it still has more instructions than the original, more of those instructions are SIMD.

Benchmarks will tell what effect this has on performance.

4.3.2.3 Clarified

SIMD Instruction Prevalence (-O3)
Instruction SetCountPercent
All67100.00
SSE11.49
SSE (Packed)11.49
SSE22435.82
SSE2 (Packed)00
SSE300
SSSE300
SSE400
AVX00

This result is actually unexpected: my version of the improved algorithm was tweaked for readability, and I expected it to result in identical assembly being emitted. The new code compiles to fewer instructions, but also makes less use of SIMD. In order to see where the differences lie, I'll need take a closer look at the assembly. My first guess is that it comes from the additional variables, or my use of std::max and std::min.

SIMD Instruction Prevalence (-O3 -march=native)
Instruction SetCountPercent
All69100.00
SSE00
SSE (Packed)00
SSE200
SSE2 (Packed)00
SSE300
SSSE300
SSE400
AVX1217.39

After AVX instructions, things look more how I expected. The function appears to compiler to the same assembly as its more verbose predecessor.

What remains a mystery is why enabling AVX would cause this, as I understand, the -march=native flag only makes new instructions available to the compiler, it shouldn't add any new paths for optimization!

4.3.2.4 Branchless

SIMD Instruction Prevalence (-O3)
Instruction SetCountPercent
All64100.00
SSE11.56
SSE (Packed)11.56
SSE22132.81
SSE2 (Packed)00
SSE300
SSSE300
SSE400
AVX00

Similar to the clarified code, this function only contains a single packed SSE function when compiled without -march=native! The only meaningful change made was to remove the conditionals (none of the math was altered), my hope was that that would unlock more optimizations by the autovectorizer, but clearly that's not the case. There are fewer total instructions, but that's to be expected as a result of the reduction in logic.

This doesn't necessarily mean that there's no advantage to the branch-free approach, but any measured advantage would likely be due to reduced branch-mispredictions, and not because of increased SIMD utilization.

SIMD Instruction Prevalence (-O3 -march=native)
Instruction SetCountPercent
All64100.00
SSE00
SSE (Packed)00
SSE200
SSE2 (Packed)00
SSE300
SSSE300
SSE400
AVX1218.75

Based on the previous table, these results aren't surprising: the branchless function compiles to very similar assembly vs. the other two equivalent functions, minus some conditional logic!

4.4 June 8 - June 15

4.4.1 Collecting Test Data for Benchmarking

Andreas, Pierre, and I discussed the challenges of producing realistic test data when evaluating the performance of an like the intersection function. They proposed extracting scenarios from the tetrahedral remeshing tests, because it's an example of a real package which heavily uses the AABB-tree to shoot rays. This is actually relatively simple to do, by inserting a print statement like the following in CGAL's ray-bbox intersection code:

if (std::is_same<FT, double>::value)
  std::cout << px << " " << py << " " << pz << " "
            << qx << " " << qy << " " << qz << " "
            << bxmin << " " << bymin << " " << bzmin << " "
            << bxmax << " " << bymax << " " << bzmax << " "
            << "\n";

(only double types are saved because those are relevant the benchmark)

This is an easy and effective way to capture the real ways that the function is used. Running test_mesh_and_remesh_polyhedral_domain_with_features.cpp saved 3,742,217 examples, producing a 376.5 MB text file! The format isn't particularly pretty or readable, but it's usable enough for parsing back into my simplified Ray and BBox types.

You may call std::cout.precision(17) in case that was not called earlier. Andreas Fabri 08:21, 10 June 2021 (CEST)

  • Thanks for the advice, enabling full precision data produced a slightly different intersection frequency! Jackson Campolattaro

4.4.2 Initial Benchmark Results

I created a simple benchmarking program with a function for timing lambda functions using std::chrono. Because I'm running this on a laptop, it was important to ensure that later tests wouldn't have a disadvantage because of rising CPU temperatures. To accomplish this, I interleaved the runs of different functions: I time the process of calculating N scenarios using each implementation, and repeat that R times, results are averaged. I found that even N=3742217 (my entire dataset) could be processed in a reasonable amount of time. For reducing time variance, I set R=100; this gave me a total runtime slightly over 30s.

(Unless otherwise indicated, times correspond to a single run through all scenarios)

Time to Complete 3,742,217 Intersection Tests
Implementations-O3-O3 -march=nativeAverage
Smits' Method5.12781e+07 ms5.07237e+07 ms5.10E+07 ms
Improved4.50296e+07 ms4.34125e+07 ms4.42E+07 ms
Clarified4.53892e+07 ms4.30072e+07 ms4.42E+07 ms
Branchless4.74904e+07 ms4.5071e+07 ms4.63E+07 ms
Average4.73E+07 ms4.56E+07 ms4.64E+07 ms

The averages along the bottom are useful for seeing how compiler flags broadly affect performance, while the averages on the right side are useful for an idea of how each implementation performs, irrespective of flags.

These benchmark results tell us a few things:

  • -march=native confers a 3-5% improvement in performance (for this task).
  • The clarified code has similar performance to the (functionally equivalent) improved code, despite its different assembly.
  • The branchless code slightly underperforms the improved and clarified versions, indicating that the early exits that those provide are worthwhile for real data. (The fact that this is so close means that I could have easily gotten the wrong impression, had I used the wrong type of synthetic data!)

Based on lessons learned here, I intend to base future implementations off of the clarified version.

4.4.3 Tweaks to Benchmarks

During our Friday meeting, we discussed ways to open up more opportunities for SIMD parallelism in the benchmarks. One idea was to take into account the fact that each ray is tested against many bounding boxes. Previously our input data contained scenarios that matched one ray against one box, I changed this so that each unique ray is paired with the collection of boxes that it's tested against. Putting the boxes in groups like this may make it possible to perform vectorizations on a larger scale, looking not just at the relationship between one ray and one box, but at how one ray relates to a large set of boxes.

This doesn't map as closely to existing real-life use of the intersection function (bounding boxes are not guaranteed to be adjacent in memory), but it does do a good job of reflecting an ideal scenario, which we should strive towards in order to make the best use of SIMD.

Time to Complete 3,742,217 Intersection Tests (Split into independent Queries)
Implementations-O3-O3 -march=nativeAverage
Smits' Method4.96E+07 ms5.06E+07 ms5.01E+07 ms
Improved2.85E+07 ms2.62E+07 ms2.73E+07 ms
Clarified2.76E+07 ms2.65E+07 ms2.71E+07 ms
Branchless3.28E+07 ms3.28E+07 ms3.28E+07 ms
Average3.46E+07 ms3.40E+07 ms3.43E+07 ms


This simple change produced a massive improvement in performance! Accross the board, times are cut by more than 25%. This is a rather spectacular difference, considering that no change was made to the functions themselves.

4.4.4 Intersection using XSimd

I added another intersection test that uses XSimd to implement the branchless version with more explicit use of SIMD. I chose XSimd because of its better usage semantics than std-simd, but I was disappointed to find that its types don't generalize as nicely for different platforms. For example, the xs::batch<double, 4> type is only available when -march=nativeis used, and only on systems that provide registers of that size Unlike std-simd, no fallback is provided in other cases (the code will simply fail to compile).

The code works properly, but doesn't perform well. For the standard benchmark, its time averages around 3.99e+07 ms; better than the naive version, but worse than all the others. It has the same number of AVX instructions as the auto-vectorized branchless version, but it also has a lot more instructions in general.

It's certainly possible to beat the compiler by using a library like this, but I get the impression that I don't (yet) have the skill to pull that off.

XSimd in particular seems poorly suited to this type of fine-grained optimization (parallelism between the x, y, z values of a vector, in this case). I get the impression that it's designed for the other type of vectorization (parallelism between multiple vectors).

NOTE: Future benchmarks will be compiled with -march=native unless otherwise indicated, because XSimd only supports 4-way batches of doubles when they're supported in hardware.

4.4.5 Switching from Line Segments to Rays

The intersection methods I've been working with take bounds arguments t0 and t1, meaning that they are actually testing for intersections with line segments, and not rays. In all previous benchmarks, I used values of +/- infinity to generalize to infinite rays. For future tests I've removed those bounds checks, meaning that the code is now specialized to infinite rays. Eliminating the logic came with a corresponding performance advantage.


Time to Complete 3,742,217 Intersection Tests
ImplementationTime
Smits' Method4.3852e+07 ms
Improved1.98313e+07 ms
Clarified1.97305e+07 ms
Branchless2.1348e+07 ms
XSimd2.59743e+07 ms

4.4.6 Effects of inline

Marking a function inline can increase binary sizes, but it has the benefit of putting each invocation of a function in context. This is especially important for autovectorization, because it gives the compiler the ability to recognize natural parallelism on a broader scale. Before, vectorization could only happen within the function, exploiting the paralellism of the x, y, and z coordinates. Puting the inline tag on the functions being tested lets the compiler make optimizations that span across function calls, this means that it can exploit the parallelism of the outer loop, using SIMD to check a ray against multiple boxes at once.

I was disappointed by the speed of the branchless implementation in earlier tests, but this is where its advantages shine through. Early exits helped performance before, but -- critically -- they hinder this type of higher-level vectorization. When a function uses different instructions depending on the data its applied to, it naturally can't be applied to batches of data. The branchless version has no such issue, so more coarse-grained optimizations can be applied.

Time to Complete 3,742,217 Intersection Tests
ImplementationTime
Smits' Method4.36403e+07 ms
Improved2.01639e+07 ms
Clarified1.97437e+07 ms
Branchless1.3715e+07 ms
XSimd1.8891e+07 ms

These results leave the question: is this representative of real world usage?

4.5 June 16 - June 22

4.5.1 Restructuring Benchmarks

In preparation for creating a new data arrangement, I spent some time cleaning up the simd experimentation code. The main changes included:

  • More narrowly defined responsibilities for the Vector3 and BBox classes.
  • Improved deserialization code for loading data from files (including istream operators for each of the custom types).
  • Namespaces for each intersection implementation
  • Results are formatted as a WikiText-syntax table, so that I can paste directly into the wiki.

4.5.2 Validating benchmarks for Correctness

My original approach for catching logic errors in the different implementations was to watch the hit-rate of their intersection tests; this was done by counting the intersections in each benchmark. Unfortunately, this doesn't catch "symmetrical" errors, where a logic mistake will result in the same number of spurious intersections as missed ones. To resolve this, I began saving results into an std::vector<bool>, to be compared at the end of the benchmark. I actually caught more than one mistake when I added this, but nothing that required additional complexity to fix (the mistakes didn't have performance implications).

4.5.2.1 Performance Implications

Unfortunately, adding this validation logic does come with a cost. Adding each result to the end of the std::vector consumes some time, and ultimately this produces significantly slower benchmarks.

Time to Complete 3742217 Intersection Tests
ImplementationTime
Smits' Method4.95657e+07 ms
Improved3.43701e+07 ms
Clarified3.43883e+07 ms
Branchless3.36523e+07 ms
XSimd4.16125e+07 ms

Luckily because each benchmark is adding the same number of results to the array, the effect is "flat". The relative performance of the different methods is left unchanged, and so the same conclusions can be reached.

4.5.3 "Struct-of-Arrays" BBox type

A common approach to improve a program's use of SIMD is to arrange the data in a "Struct-of-Arrays" rather than an "Array-of-Structs" format. This means that data can be collected into SIMD registers using aligned load operations, rather than a "gather".

For example, a collection of 8 (x, y, z) vectors stored in the AoS format would look like ...xyzxyzxyzxyzxyzxyzxyzxyz... in memory; this makes it easy to load an entire vector at once, but not to load a group of x values as a batch. On the other hand, the same data in SoA format would look more like xxxxxxxx...yyyyyyyy...zzzzzzzz. This makes it easier to operate on groups of x values using simd registers, but it has the downside that the data associated with a specific vector is scattered.

I am optimistic that an SoA packing strategy could be used to improve performance on this benchmark, but it's notably less straightforward to apply to CGAL than the AoS strategy already being used. Arranging data in this way is an optimization that punctures the veil of many high-level abstractions.

The use-case I've been exploring for the intersection function compares each ray to a large number of bounding boxes. Because of this, we should be able to apply an SoA philosophy to the boxes only, while keeping the rays AoS.

4.5.3.1 VBBox

My "Vector-Bounding Box" type extends the BBox<T> type, where T is an std::vector.

The VBBox has a constructor which takes a vector of scalar bounding boxes and packs their underlying data into vectors. The files containing test data are naturally in Array-of-Structs form, so loading them directly into the new data structures would be awkward. Instead, I created a function which produces SoA data from already loaded SoA data.

It also adds an access method getr(std::size_t i) which retrieves a "cross-sectional view" of the data structure. This is another bounding box, where T is a const T&. Because I'm using STL data structures to hold T inside my classes, I substituted std::reference_wrapper<const T> for const T&. With the help of this accessor, I can perform intersection tests on individual boxes in the VBBox without any changes to the scalar intersection code.

4.5.3.2 Benchmark Results

Time to Complete 3742217 Intersection Tests
ImplementationTime
Explicit SIMD4.53096e+07 ms
Implicit SIMD4.74185e+07 ms

The fact that the explicit version of the algorithm outperformed the implicit one surprised me. Though the difference is small, it seems like by manually adding simd instructions I found some avenues for optimization that the compiler could not.

Sadly, I was unimpressed with the performance of both AoS methods; both perform worse than the SoA version, while using an equivalent branchless algorithm. My suspicion is that the new arrangement in memory hurts cache performance, which becomes relevant for a computationally-light task like intersection tests.

4.5.3.3 Relevance to Real-life use

Could we use this strategy in CGAL? Knowing that this approach underperforms perhaps makes this question less relevant, but it's still worth looking at.

I believe this isn't actually an option, in order to pack our bounding boxes, we would need to know the relevant ones (the ones we'll be checking for intersections) beforehand. This is easy for some applications (particularly brute force), but to do this in the course of traversing a tree would be tantamount to seeing the future.

4.5.4 Caching Broadcasted Ray

Because our intersection compares one ray to several boxes, there are many places in the code where it's necessary to apply a scalar operation to many elements, for example [1.0, 2.0, 3.5, 2.0] * 4.0 = [4.0, 8.0, 14.0, 8.0]. In python/numpy terms, this type of elementwise math is known as a "broadcasted" operation. xsimd provides overloaded operators that make this easy to do, and that's how I created the original explicit intersection function.

The catch, here, is that AVX, SSE, et al don't actually support broadcasted operations! The CPU can only perform operations between simd registers. As a result, any time we want to do a broadcast operation we also need to do a "broadcast load", which puts the same value in each slot of the simd register. This is repeated every time the intersection is invoked -- even if the ray is the same -- and the optimizer doesn't appear to notice the repeated loads.

We can fix this by performing the broadcast in the outer loop, and then using that cached result for our intersections. Whenever we start performing intersections with a new ray, we (manually) broadcast the ray's parameters to xsimd registers. These broadcasted values are used for all the relevant intersections.

4.5.4.1 Benchmark Results

Time to Complete 3742217 Intersection Tests
ImplementationTime
Explicit SIMD4.35466e+07 ms
Implicit SIMD4.61569e+07 ms

This change appears to have produced a modest performance improvement vs performing the broadcast each time. The explicit code now outperforms the implicit code by a slightly larger margin.

4.6 June 23 - June 30

4.6.1 "Array of Structs of Arrays" BBox

In an "AoSoA" arrangement, the values are grouped together into batches, each batch the size of the CPU's SIMD registers. On a CPU with 4-way SIMD, data would be arranged like ...xxxxyyyyzzzzxxxxyyyyzzzzxxxxyyyyzzzz... in memory.

This strategy has several advantages:

  • Data does not need to be loaded from scattered locations when filling a single SIMD register (unlike the AoS strategy).
  • Data does not need to be loaded from faraway locations in memory when loading relevant variables into different registers (unlike the SoA strategy, where each x, y, z set can be very far away in memory).
  • This model actually lines up closely with the way some proposed improvements to the AABB-tree would arrange the bounding boxes in memory.

It also comes with its own challenges:

  • The number of items may not be aligned with the chosen batch size, (e.g. 47 elements won't cleanly fit into a AoSoA data structure with batch size 4). In this case, it's necessary to retain an additional AoS data structure to hold the leftovers.

So what does it look like to implement this for my own BBox type? Because of my use of templates, the relevant data structure is almost as simple as a std::vector<BBox<xsimd::batch<double, 4>>> coupled with an std::vector<BBox<double>> to hold the leftovers. Most of the challenges in implementation actually came from loading the data structure during parsing, and I ended up creating a XBBox type (which extends the BBox<xsimd::batch<>>>` type) to help with conversion from scalar boxes. Once the data was packed properly, it became simple to use intersection methods I had previously written to process the structure (using explicit or implicit SIMD).

4.6.1.1 Benchmark Results

Time to Complete 3742217 Intersection Tests (AoSoA Data)
ImplementationTime
Explicit SIMD3.61035e+07 ms
Implicit SIMD3.50253e+07 ms

Benchmarking shows that this strategy performs better than the SoA model, but doesn't quite beat the original AoS approach.

4.7 July 1 - July 8

4.7.1 Performing Benchmarks on Deepsat

Deepsat is a Xeon server that I and other members of the Inria TITANE team share access to. After arriving at the Inria campus, I decided to re-run my benchmarks on the machine as a way to build familiarity with the workflow here. Most of the server's advantages aren't relevant to my test case: the 20 CPU cores don't benefit my single-threaded program, and the 132Gb of ram aren't necessary for my 900Mb dataset. What's important to me is the consistency the machine can provide. Whereas my laptop is thermally limited, the server has the ability to run benchmark after benchmark with no decrease to its performance!

I connected to Deepsat over SSH, and recompiled the solutions to take full advantage of the modern Xeon CPU's instruction set. Because my benchmarks contain their own timing and result-reporting logic, producing the following tables was as simple as running each in turn.

4.7.1.1 Results

Time to Complete 3742217 Intersection Tests on AoS Data
ImplementationTime
Smits' Method6.38517e+07 ms
Improved3.70988e+07 ms
Clarified3.65901e+07 ms
Branchless3.68068e+07 ms
XSimd4.14903e+07 ms
Time to Complete 3742217 Intersection Tests on SoA Data
ImplementationTime
Explicit SIMD5.93713e+07 ms
Implicit SIMD4.44437e+07 ms
Time to Complete 3742217 Intersection Tests on AoSoA Data
ImplementationTime
Explicit SIMD3.96977e+07 ms
Implicit SIMD3.93736e+07 ms

Performance was slightly worse across the board, but this was expected; Deepsat's Xeon CPU excels at embarrassingly parallel workloads processing massive datasets, and this benchmark is neither of those things. Consumer hardware is optimized for single-core speed, and that's an advantage for this type of work.

Despite its differences, this data leads us to the same conclusions as previous results.

4.7.2 Experimenting with Application of OpenMP Directives

One approach for SIMD that we haven't discussed so far is OpenMP. This is because I've been having trouble getting interesting results from it.

Simply put, I've yet to run into a case where adding a #pragma omp ... tag resulted in the compiler making any different choices. Adding it to the benchmark code didn't affect performance, so to confirm my suspicions I created a simple program similar to IBM's example of how to apply OpenMP for SIMD.

// Naive function
float func_naive(float a, float b) {
    return std::max(a, b);
}
 
// Openmp-vectorized function
#pragma omp declare simd
float func_openmp(float a, float b) {
    return std::max(a, b);
}

These implementations give us the following results:

Naive: 0.677483s
Openmp:  0.554589s

The OpenMP tag does give us some improvement, but the difference is marginal (and I haven't yet confirmed how much of that comes from not interleaving the benchmarks).

// Naive function
inline float func_naive(float a, float b) {
    return std::max(a, b);
}
 
// Openmp-vectorized function
#pragma omp declare simd
inline float func_openmp(float a, float b) {
    return std::max(a, b);
}
Naive: 0.730367
Openmp:  0.571762

Inlining both functions doesn't result in significant changes to performance; they're small enough that they were likely already effectively inlined by the compiler.

4.7.3 Profiling Mesh_3 remesh_polyhedral_surface_sm

On Andreas' suggestion, I will be using Mesh_3's remesh_polyhedral_surface_sm as a representation of typical usage of the AABB-tree in future tests.

I started by using the profiling tool Perf to identify "hot points" in the program. This is important, because it helps us determine where our efforts are best applied if we want to improve performance. Up until now, I've been using ray-bbox intersection as a target for optimization, under the assumption that because that operation is necessary very often during traversal it must be important to performance. My hope is that a significant amount of time is spent on checking for these intersections, because it means that my work up until now has the potential to be more impactful.

Luckily, profiling results seemed to confirm this prediction: 348 of Perf's samples were during the Mesh refinement operation (most of the remaining were during data-loading). Of those, 155 were inside the ray-bbox intersection code (CGAL::Intersections::internal::do_intersect_bbox_segment_aux). This tells us that approximately 44.5% of the time performing the remeshing operation is spent just on checking for intersections between rays and bounding boxes, while traversing the tree!

4.7.4 Branchless do_intersect_bbox_segment_aux

The ray-bbox intersection reference implementation has been a good way of experimenting with different ideas and learning about SIMD techniques, but continuing to optimize it encounters diminishing returns. It's begun to make sense to experiment with applying similar techniques to real CGAL code. With that in mind, I evaluated making CGAL's own ray-bbox intersection code branchless. This is a good first step because it's closely related to the work I've done so far, and future optimizations like child-skipping can benefit from such a change.

I started by creating a new branch on my own fork of CGAL dedicated to the branchless intersection change, I'm planning to create branches dedicated to many tests like this one, so it doesn't make sense to add all of these branches to cgal-public-dev (instead I'll merge them into my branch there if they produce good results). Next, I annotated the relevant function to build familiarity. CGAL's intersection guarantees exactness, but doesn't lean on the inherent exactness of the kernel. Instead, it coerces the floating point type to double and does the intersection math in double precision, but uses a custom class to keep track of the possible accumulated error. It uses that class to determine the margin it needs to be certain of its result, and has a return type of Uncertain<bool>, so that it can return an indeterminate result if the values fall within the margin. The result is an intersection function significantly more elaborate than the ones I've been working with, totaling over 350 lines.

The existing function was not easily amenable to a branchless structure. I especially had trouble with the way it mixes conditional logic that's based on template parameters, and logic based on arguments. For example, in if(bounded_1 && qx > bxmax) return false;, bounded_1 is a template parameter, so conditional logic which depends on it can be evaluated as a constexpr (resulting in branchless assembly). qx > bxmax does not have the same advantage, so for instantiations bounded_1 is true this becomes an early exit. This type of structuring makes perfect sense in context, but doesn't map cleanly to a branchless alternative. More challenging, a lot of the logic later in the code depends on the guarantees provided by early exits. This means that eliminating branches has to be done holistically, rather than as a series of local transformations to the function's logic.

For a simpler test, I replaced the entire existing code with a simple version adapted from my previous branchless code.

template<
        typename FT,
        typename BFT,
        bool bounded_0,
        bool bounded_1,
        bool use_static_filters
>
inline typename Do_intersect_bbox_segment_aux_is_greater<
        FT,
        bounded_0,
        use_static_filters
>::result_type
do_intersect_bbox_segment_aux(
              const FT &px, const FT &py, const FT &pz,
              const FT &qx, const FT &qy, const FT &qz,
              const BFT &bxmin, const BFT &bymin, const BFT &bzmin,
              const BFT &bxmax, const BFT &bymax, const BFT &bzmax) {
 
        const BFT bx[] = {bxmin, bxmax};
        const BFT by[] = {bymin, bymax};
        const BFT bz[] = {bzmin, bzmax};
 
        const int sx = qx < 0;
        const int sy = qy < 0;
        const int sz = qz < 0;
 
        // Determine bounds x, y, and z
        FT xmin = (bx[sx] - px) / (qx - px);
        FT xmax = (bx[1 - sx] - px) / (qx - px);
        FT ymin = (by[sy] - py) / (qy - py);
        FT ymax = (by[1 - sy] - py) / (qy - py);
        FT zmin = (bz[sz] - pz) / (qz - pz);
        FT zmax = (bz[1 - sz] - pz) / (qz - pz);
 
        // Determine the bounds of the overlapping region
        FT min = std::max({xmin, ymin, zmin});
        FT max = std::min({xmax, ymax, zmax});
 
        // The ray intercepts if this region overlaps with the interval provided
        return (max >= min) &&            // Check if there is any overlap at all
               !(bounded_0 && max < 0) && // Check if the overlap is outside the ray (before the start)
               !(bounded_1 && min > 1);   // Check if the overlap is outside the ray (after the end)
}

I also added a simple parity check against the old method, when it's not uncertain. This gives me some confidence in this code's correctness, though the dataset tested probably doesn't hit every edge case.

This version uses the full precision types provided by the kernel, meaning that it will never return an uncertain value. This comes with a performance penalty (especially for the kernels with the strongest guarantees of exactness), but it massively simplifies the logic, and may improve performance in real tasks. This is because in the old code, uncertain results were assumed to be hits, resulting in looking at more nodes than strictly necessary.

Simple benchmarking was done using the time command on single runs of the compiled program. Using the old solution, I get performance that tends to look like:

time ~/Documents/cgal-public-dev/Mesh_3/examples/Mesh_3/cmake-build-release/remesh_polyhedral_surface_sm

real    0m0.461s
user    0m0.453s
sys     0m0.006s

Once the optimization is applied, the program is much faster:

time ~/Documents/cgal-public-dev/Mesh_3/examples/Mesh_3/cmake-build-release/remesh_polyhedral_surface_sm

real    0m0.181s
user    0m0.172s
sys     0m0.005s

The actual improvement is actually much larger than this might suggest, my profiler indicates that the number of samples during traversal went from 182 to just 18 (most of the time is spent constructing the tree). It's challenging to determine how much of this improvement comes from the simpler, faster intersection function, and how much comes from the reduction of spurious intersections. In any case, the complexity of the kernel's exact floating point type means that it's unlikely that this change actually meaningfully increased the use of SIMD.

4.7.5 Meeting with Adrien Cassagne

Adrien Cassagne is a researcher elsewhere at Inria with work that makes extensive use of SIMD. Most relevant is his paper on MIPP, a custom SIMD wrapper library which he wrote as part of another line of research. Pierre contacted Adrien, and was able to schedule a meeting for the 6th of July. In preparation for our meeting, I prepared a set of slides to explain the problem statement.

Adrien's experience turned out to be relevant to our work, and he made some salient points about our approach so far. He said that wrapper libraries seemed like a good level of abstraction for our needs, which is the same conclusion we came to. He was surprised that our "AoS" and "AoSoA" implementations weren't faster (especially given our high ratio of memory-heavy operations to numerically intense ones), and he offered to examine the test code to try and understand why it would underperform.

Adrien agreed that SIMD was a worthwhile endeavor for CGAL, and he expects to have extra free time until September, so he's interested in arranging more meetings to help.

He also made several suggestions about how to best vectorize the AABB-tree, though he noted that he's never vectorized a tree structure like this before. One that interested me was his mention of stream traversal, performing queries with groups of similar rays simultaneously. This is something that Embree does, so it caught my attention that he independently proposed the same optimization. Stream traversal would mesh well with CGAL's existing abstractions (it could be as simple as defining a new query type which bundles several ray queries), but the catch is that it works best when the rays being cast are very similar. In rendering this is as done by casting rays of adjacent pixels, but for CGAL it's difficult to think of a relevant use-case. Do any of CGAL's packages have a tendency to perform queries of similar rays?

4.7.6 Branchless BBox-BBox Intersection

CGAL's BBox-BBox intersection function is significantly simpler than Ray-BBox intersection. This is because bounding boxes are axis-aligned, so a pair of boxes intersect whenever their x, y, and z bounds all intersect. Because all of the comparisons are along axes there isn't any opportunity for accumulated error, this means all the math can be done using the double type without any additional logic needed to add margins.

As you'll see in the assembly, making the logic branchless isn't as simple as combining the logic into a single expression.

Branchless Conversion of BBox-BBox Intersection
ApproachCodeAssembly
Original
bool
do_overlap(const Bbox_3& bb1, const Bbox_3& bb2)
{
    // check for emptiness ??
    if (bb1.xmax() < bb2.xmin() || bb2.xmax() < bb1.xmin())
        return false;
    if (bb1.ymax() < bb2.ymin() || bb2.ymax() < bb1.ymin())
        return false;
    if (bb1.zmax() < bb2.zmin() || bb2.zmax() < bb1.zmin())
        return false;
    return true;
}
00000000004012b0 <CGAL::do_overlap(CGAL::Bbox_3 const&, CGAL::Bbox_3 const&)>:
  4012b0:           f2 0f 10 06                 movsd  (%rsi),%xmm0
  4012b4:           31 c0                       xor    %eax,%eax
  4012b6:           66 0f 2f 47 18              comisd 0x18(%rdi),%xmm0
  4012bb:       /-- 77 3c                       ja     4012f9 <CGAL::do_overlap(CGAL::Bbox_3 const&, CGAL::Bbox_3 const&)+0x49>
  4012bd:       |   f2 0f 10 07                 movsd  (%rdi),%xmm0
  4012c1:       |   66 0f 2f 46 18              comisd 0x18(%rsi),%xmm0
  4012c6:       +-- 77 31                       ja     4012f9 <CGAL::do_overlap(CGAL::Bbox_3 const&, CGAL::Bbox_3 const&)+0x49>
  4012c8:       |   f2 0f 10 46 08              movsd  0x8(%rsi),%xmm0
  4012cd:       |   66 0f 2f 47 20              comisd 0x20(%rdi),%xmm0
  4012d2:       +-- 77 25                       ja     4012f9 <CGAL::do_overlap(CGAL::Bbox_3 const&, CGAL::Bbox_3 const&)+0x49>
  4012d4:       |   f2 0f 10 47 08              movsd  0x8(%rdi),%xmm0
  4012d9:       |   66 0f 2f 46 20              comisd 0x20(%rsi),%xmm0
  4012de:       +-- 77 19                       ja     4012f9 <CGAL::do_overlap(CGAL::Bbox_3 const&, CGAL::Bbox_3 const&)+0x49>
  4012e0:       |   f2 0f 10 46 10              movsd  0x10(%rsi),%xmm0
  4012e5:       |   66 0f 2f 47 28              comisd 0x28(%rdi),%xmm0
  4012ea:       +-- 77 0d                       ja     4012f9 <CGAL::do_overlap(CGAL::Bbox_3 const&, CGAL::Bbox_3 const&)+0x49>
  4012ec:       |   f2 0f 10 47 10              movsd  0x10(%rdi),%xmm0
  4012f1:       |   66 0f 2f 46 28              comisd 0x28(%rsi),%xmm0
  4012f6:       |   0f 96 c0                    setbe  %al
  4012f9:       \-> c3                          retq

It's immediately clear the original code is riddled with early exits. These early exits may or may not be helpful for performance during sequential operation, but they pose a problem when we want to vectorize this code.

Single-expression
bool
do_overlap(const Bbox_3& bb1, const Bbox_3& bb2)
{
  // Only return true if there's overlap on every axis
  return !(bb1.xmax() < bb2.xmin() || bb2.xmax() < bb1.xmin()) && // No overlap on x
         !(bb1.ymax() < bb2.ymin() || bb2.ymax() < bb1.ymin()) && // No overlap on y
         !(bb1.zmax() < bb2.zmin() || bb2.zmax() < bb1.zmin());   // No overlap on z
}
00000000004012b0 <CGAL::do_overlap(CGAL::Bbox_3 const&, CGAL::Bbox_3 const&)>:
  4012b0:           f2 0f 10 06                 movsd  (%rsi),%xmm0
  4012b4:           31 c0                       xor    %eax,%eax
  4012b6:           66 0f 2f 47 18              comisd 0x18(%rdi),%xmm0
  4012bb:       /-- 77 3c                       ja     4012f9 <CGAL::do_overlap(CGAL::Bbox_3 const&, CGAL::Bbox_3 const&)+0x49>
  4012bd:       |   f2 0f 10 07                 movsd  (%rdi),%xmm0
  4012c1:       |   66 0f 2f 46 18              comisd 0x18(%rsi),%xmm0
  4012c6:       +-- 77 31                       ja     4012f9 <CGAL::do_overlap(CGAL::Bbox_3 const&, CGAL::Bbox_3 const&)+0x49>
  4012c8:       |   f2 0f 10 46 08              movsd  0x8(%rsi),%xmm0
  4012cd:       |   66 0f 2f 47 20              comisd 0x20(%rdi),%xmm0
  4012d2:       +-- 77 25                       ja     4012f9 <CGAL::do_overlap(CGAL::Bbox_3 const&, CGAL::Bbox_3 const&)+0x49>
  4012d4:       |   f2 0f 10 47 08              movsd  0x8(%rdi),%xmm0
  4012d9:       |   66 0f 2f 46 20              comisd 0x20(%rsi),%xmm0
  4012de:       +-- 77 19                       ja     4012f9 <CGAL::do_overlap(CGAL::Bbox_3 const&, CGAL::Bbox_3 const&)+0x49>
  4012e0:       |   f2 0f 10 46 10              movsd  0x10(%rsi),%xmm0
  4012e5:       |   66 0f 2f 47 28              comisd 0x28(%rdi),%xmm0
  4012ea:       +-- 77 0d                       ja     4012f9 <CGAL::do_overlap(CGAL::Bbox_3 const&, CGAL::Bbox_3 const&)+0x49>
  4012ec:       |   f2 0f 10 47 10              movsd  0x10(%rdi),%xmm0
  4012f1:       |   66 0f 2f 46 28              comisd 0x28(%rsi),%xmm0
  4012f6:       |   0f 96 c0                    setbe  %al
  4012f9:       \-> c3                          retq

Refactoring into a single expression may seem like it should result in branchless operation -- after all, it's only producing a single boolean and returning it -- but for reasons I'll explain shortly it produces exactly the same assembly as the original version!

"Branchless"
bool
do_overlap(const Bbox_3& bb1, const Bbox_3& bb2)
{
  // Determine how much overlap exists in each axis
  double x_overlap = std::min(bb1.xmax(), bb2.xmax()) - std::max(bb1.xmin(), bb2.xmin());
  double y_overlap = std::min(bb1.ymax(), bb2.ymax()) - std::max(bb1.ymin(), bb2.ymin());
  double z_overlap = std::min(bb1.zmax(), bb2.zmax()) - std::max(bb1.zmin(), bb2.zmin());
 
  // Only return true if there's overlap on every axis
  return x_overlap >= 0 &&
         y_overlap >= 0 &&
         z_overlap >= 0;
}
00000000004012b0 <CGAL::do_overlap(CGAL::Bbox_3 const&, CGAL::Bbox_3 const&)>:
  4012b0:           f2 0f 10 46 18              movsd  0x18(%rsi),%xmm0
  4012b5:           f2 0f 10 0e                 movsd  (%rsi),%xmm1
  4012b9:           66 0f ef ed                 pxor   %xmm5,%xmm5
  4012bd:           f2 0f 5f 0f                 maxsd  (%rdi),%xmm1
  4012c1:           f2 0f 5d 47 18              minsd  0x18(%rdi),%xmm0
  4012c6:           f2 0f 10 5e 08              movsd  0x8(%rsi),%xmm3
  4012cb:           f2 0f 10 56 28              movsd  0x28(%rsi),%xmm2
  4012d0:           f2 0f 10 66 10              movsd  0x10(%rsi),%xmm4
  4012d5:           f2 0f 5f 5f 08              maxsd  0x8(%rdi),%xmm3
  4012da:           f2 0f 5c c1                 subsd  %xmm1,%xmm0
  4012de:           f2 0f 5d 57 28              minsd  0x28(%rdi),%xmm2
  4012e3:           f2 0f 10 4e 20              movsd  0x20(%rsi),%xmm1
  4012e8:           f2 0f 5f 67 10              maxsd  0x10(%rdi),%xmm4
  4012ed:           f2 0f 5d 4f 20              minsd  0x20(%rdi),%xmm1
  4012f2:           66 0f 2f c5                 comisd %xmm5,%xmm0
  4012f6:       /-- 72 18                       jb     401310 <CGAL::do_overlap(CGAL::Bbox_3 const&, CGAL::Bbox_3 const&)+0x60>
  4012f8:       |   f2 0f 5c cb                 subsd  %xmm3,%xmm1
  4012fc:       |   66 0f 2f cd                 comisd %xmm5,%xmm1
  401300:       +-- 72 0e                       jb     401310 <CGAL::do_overlap(CGAL::Bbox_3 const&, CGAL::Bbox_3 const&)+0x60>
  401302:       |   f2 0f 5c d4                 subsd  %xmm4,%xmm2
  401306:       |   66 0f 2f d5                 comisd %xmm5,%xmm2
  40130a:       |   0f 93 c0                    setae  %al
  40130d:       |   c3                          retq
  40130e:       |   66 90                       xchg   %ax,%ax
  401310:       \-> 31 c0                       xor    %eax,%eax
  401312:           c3                          retq

This version replaces the original approach with something that looks at the problem "intuitively". It does reduce the number of early exits, but it still doesn't actually produce branchless logic. Looking at the code, it's not entirely clear where our conditional behavior comes from.

The catch: C++'s short circuit evaluation introduces jumps! In our logical expression x_overlap >= 0 && y_overlap >= 0 && z_overlap >= 0, the C++ standard guarantees that y_overlap >= 0 && z_overlap >= 0 should not be evaluated if x_overlap >= 0 is false. To achieve this, the compiler must add an "invisible" conditional.

I've found short-circuiting behavior helpful only on occasion, and I hadn't considered the degree to which it hamstrings the compiler. Even though y_overlap >= 0 && z_overlap >= 0 clearly has no side-effects, the compiler must insert logic to avoid evaluating it, when it may be detrimental to performance. Luckily, C++ provides several ways of avoiding short-circuits in logic.

Explicit branchless

(intermediate values)

bool
do_overlap(const Bbox_3& bb1, const Bbox_3& bb2)
{
 
  // Determine how much overlap exists in each axis
  double x_overlap = std::min(bb1.xmax(), bb2.xmax()) - std::max(bb1.xmin(), bb2.xmin());
  double y_overlap = std::min(bb1.ymax(), bb2.ymax()) - std::max(bb1.ymin(), bb2.ymin());
  double z_overlap = std::min(bb1.zmax(), bb2.zmax()) - std::max(bb1.zmin(), bb2.zmin());
 
  // Determine whether each axis has a positive overlap
  bool x_intersect = x_overlap >= 0;
  bool y_intersect = y_overlap >= 0;
  bool z_intersect = z_overlap >= 0;
 
  // Only intersects if there's overlap on every axis
  bool xy_intersect = x_intersect && y_intersect;
  bool intersect = xy_intersect && z_intersect;
 
  return intersect;
}
00000000004012b0 <CGAL::do_overlap(CGAL::Bbox_3 const&, CGAL::Bbox_3 const&)>:
  4012b0:       f2 0f 10 56 18          movsd  0x18(%rsi),%xmm2
  4012b5:       f2 0f 10 0e             movsd  (%rsi),%xmm1
  4012b9:       f2 0f 5d 57 18          minsd  0x18(%rdi),%xmm2
  4012be:       f2 0f 5f 0f             maxsd  (%rdi),%xmm1
  4012c2:       f2 0f 10 5e 08          movsd  0x8(%rsi),%xmm3
  4012c7:       f2 0f 5f 5f 08          maxsd  0x8(%rdi),%xmm3
  4012cc:       f2 0f 10 46 28          movsd  0x28(%rsi),%xmm0
  4012d1:       f2 0f 10 66 10          movsd  0x10(%rsi),%xmm4
  4012d6:       f2 0f 5c d1             subsd  %xmm1,%xmm2
  4012da:       f2 0f 10 4e 20          movsd  0x20(%rsi),%xmm1
  4012df:       f2 0f 5d 4f 20          minsd  0x20(%rdi),%xmm1
  4012e4:       f2 0f 5d 47 28          minsd  0x28(%rdi),%xmm0
  4012e9:       f2 0f 5f 67 10          maxsd  0x10(%rdi),%xmm4
  4012ee:       f2 0f 5c cb             subsd  %xmm3,%xmm1
  4012f2:       66 0f ef db             pxor   %xmm3,%xmm3
  4012f6:       66 0f 2f d3             comisd %xmm3,%xmm2
  4012fa:       f2 0f 5c c4             subsd  %xmm4,%xmm0
  4012fe:       0f 93 c0                setae  %al
  401301:       66 0f 2f cb             comisd %xmm3,%xmm1
  401305:       0f 93 c2                setae  %dl
  401308:       21 d0                   and    %edx,%eax
  40130a:       66 0f 2f c3             comisd %xmm3,%xmm0
  40130e:       0f 93 c2                setae  %dl
  401311:       21 d0                   and    %edx,%eax
  401313:       c3                      retq

Here, I use additional variables to ensure that intermediate values are actually computed. This is the first version to produce true straight-line assembly!

This tends to be the best-practice for this type of thing, in this case the resulting code is terribly unclear. Even if this isn't how the assembly does things, it's preferable to treat x, y, and z symmetrically.

Explicit branchless

(boolean operations)

bool
do_overlap(const Bbox_3& bb1, const Bbox_3& bb2)
{
  // Determine how much overlap exists in each axis
  double x_overlap = std::min(bb1.xmax(), bb2.xmax()) - std::max(bb1.xmin(), bb2.xmin());
  double y_overlap = std::min(bb1.ymax(), bb2.ymax()) - std::max(bb1.ymin(), bb2.ymin());
  double z_overlap = std::min(bb1.zmax(), bb2.zmax()) - std::max(bb1.zmin(), bb2.zmin());
 
  // Determine whether each axis has a positive overlap
  bool x_intersect = x_overlap >= 0;
  bool y_intersect = y_overlap >= 0;
  bool z_intersect = z_overlap >= 0;
 
  // Only intersects if there's overlap on every axis
  bool intersect = x_intersect & y_intersect & z_intersect;
 
  return intersect;
}
00000000004012b0 <CGAL::do_overlap(CGAL::Bbox_3 const&, CGAL::Bbox_3 const&)>:
  4012b0:       f2 0f 10 56 18          movsd  0x18(%rsi),%xmm2
  4012b5:       f2 0f 10 0e             movsd  (%rsi),%xmm1
  4012b9:       f2 0f 5d 57 18          minsd  0x18(%rdi),%xmm2
  4012be:       f2 0f 5f 0f             maxsd  (%rdi),%xmm1
  4012c2:       f2 0f 10 5e 08          movsd  0x8(%rsi),%xmm3
  4012c7:       f2 0f 5f 5f 08          maxsd  0x8(%rdi),%xmm3
  4012cc:       f2 0f 10 46 28          movsd  0x28(%rsi),%xmm0
  4012d1:       f2 0f 10 66 10          movsd  0x10(%rsi),%xmm4
  4012d6:       f2 0f 5c d1             subsd  %xmm1,%xmm2
  4012da:       f2 0f 10 4e 20          movsd  0x20(%rsi),%xmm1
  4012df:       f2 0f 5d 4f 20          minsd  0x20(%rdi),%xmm1
  4012e4:       f2 0f 5d 47 28          minsd  0x28(%rdi),%xmm0
  4012e9:       f2 0f 5f 67 10          maxsd  0x10(%rdi),%xmm4
  4012ee:       f2 0f 5c cb             subsd  %xmm3,%xmm1
  4012f2:       66 0f ef db             pxor   %xmm3,%xmm3
  4012f6:       66 0f 2f d3             comisd %xmm3,%xmm2
  4012fa:       f2 0f 5c c4             subsd  %xmm4,%xmm0
  4012fe:       0f 93 c0                setae  %al
  401301:       66 0f 2f cb             comisd %xmm3,%xmm1
  401305:       0f 93 c2                setae  %dl
  401308:       21 d0                   and    %edx,%eax
  40130a:       66 0f 2f c3             comisd %xmm3,%xmm0
  40130e:       0f 93 c2                setae  %dl
  401311:       21 d0                   and    %edx,%eax
  401313:       c3                      retq

Using bitwise operators produces identical assembly, but lets us preserve code structure. Both this solution and the previous solution suffer the same readability issue: someone who isn't familiar with the reason behind the "ugly" code is liable to clean it up, reintroducing the short circuits. This issue can be resolved by surrounding the code with explanatory comments. With that in mind I prefer this solution, because it's more easily readable at a glance.

Legible branchless
bool
do_overlap(const Bbox_3& bb1, const Bbox_3& bb2)
{
  // Only return true if there's overlap on every axis
  return !(bb1.xmax() < bb2.xmin() | bb2.xmax() < bb1.xmin()) & // No overlap on x
         !(bb1.ymax() < bb2.ymin() | bb2.ymax() < bb1.ymin()) & // No overlap on y
         !(bb1.zmax() < bb2.zmin() | bb2.zmax() < bb1.zmin());  // No overlap on z
}
00000000004012b0 <CGAL::do_overlap(CGAL::Bbox_3 const&, CGAL::Bbox_3 const&)>:
  4012b0:       f2 0f 10 06             movsd  (%rsi),%xmm0
  4012b4:       66 0f 2f 47 18          comisd 0x18(%rdi),%xmm0
  4012b9:       f2 0f 10 07             movsd  (%rdi),%xmm0
  4012bd:       0f 97 c0                seta   %al
  4012c0:       66 0f 2f 46 18          comisd 0x18(%rsi),%xmm0
  4012c5:       f2 0f 10 46 08          movsd  0x8(%rsi),%xmm0
  4012ca:       0f 97 c2                seta   %dl
  4012cd:       09 d0                   or     %edx,%eax
  4012cf:       66 0f 2f 47 20          comisd 0x20(%rdi),%xmm0
  4012d4:       f2 0f 10 47 08          movsd  0x8(%rdi),%xmm0
  4012d9:       0f 97 c2                seta   %dl
  4012dc:       09 d0                   or     %edx,%eax
  4012de:       66 0f 2f 46 20          comisd 0x20(%rsi),%xmm0
  4012e3:       f2 0f 10 46 10          movsd  0x10(%rsi),%xmm0
  4012e8:       0f 97 c2                seta   %dl
  4012eb:       09 d0                   or     %edx,%eax
  4012ed:       66 0f 2f 47 28          comisd 0x28(%rdi),%xmm0
  4012f2:       f2 0f 10 47 10          movsd  0x10(%rdi),%xmm0
  4012f7:       0f 97 c2                seta   %dl
  4012fa:       09 d0                   or     %edx,%eax
  4012fc:       66 0f 2f 46 28          comisd 0x28(%rsi),%xmm0
  401301:       0f 97 c2                seta   %dl
  401304:       09 d0                   or     %edx,%eax
  401306:       83 f0 01                xor    $0x1,%eax
  401309:       c3                      retq

Now that we know a solution to the root issue, we can apply it to our first approach. This code contains none of the jumps that approach had, despite appearing almost identical!

This process illustrates the unique challenges that come with making complex logic branchless. Problems are often well-hidden, and solutions unintuitive. This is related to one of the advantages that a library like xsimd presents: it's explicitness means that we can have greater confidence in the structure of the produced assembly, without needing to examine it directly.

4.8 July 9 - July 16

4.8.1 Truly Branchless do_intersect_bbox_segment_aux

Earlier I said that the FT type used by the new ray-bbox intersection function was likely too elaborate to be effectively vectorized. Further investigation revealed that in the case of the program I'm using as a benchmark, FT is defined as double! This can't be assumed to be true for most Kernels, but in this particular case it provides an opportunity to look more closely at how the function gets optimized.

Branchless Conversion of Ray-BBox Intersection
ApproachCodeAssembly
With Short-circuits
template<
        typename FT,
        typename BFT,
        bool bounded_0,
        bool bounded_1,
        bool use_static_filters
>
typename Do_intersect_bbox_segment_aux_is_greater<
        FT,
        bounded_0,
        use_static_filters
>::result_type
do_intersect_bbox_segment_aux(
        const FT &px, const FT &py, const FT &pz,
        const FT &qx, const FT &qy, const FT &qz,
        const BFT &bxmin, const BFT &bymin, const BFT &bzmin,
        const BFT &bxmax, const BFT &bymax, const BFT &bzmax) {
 
    const BFT bx[] = {bxmin, bxmax};
    const BFT by[] = {bymin, bymax};
    const BFT bz[] = {bzmin, bzmax};
 
    const int sx = qx < 0;
    const int sy = qy < 0;
    const int sz = qz < 0;
 
    // Determine bounds x, y, and z
    FT xmin = (bx[sx] - px) / (qx - px);
    FT xmax = (bx[1 - sx] - px) / (qx - px);
    FT ymin = (by[sy] - py) / (qy - py);
    FT ymax = (by[1 - sy] - py) / (qy - py);
    FT zmin = (bz[sz] - pz) / (qz - pz);
    FT zmax = (bz[1 - sz] - pz) / (qz - pz);
 
    // Determine the bounds of the overlapping region
    FT min = std::max({xmin, ymin, zmin});
    FT max = std::min({xmax, ymax, zmax});
 
    // The ray intercepts if this region overlaps with the interval provided
    Uncertain<bool> new_result =
            (max >= min) &&            // Check if there is any overlap at all
            !(bounded_0 && max < 0) && // Check if the overlap is outside the ray (before the start)
            !(bounded_1 && min > 1);   // Check if the overlap is outside the ray (after the end)
 
    return new_result;
}
000000000040c480 <CCGAL::Intersections::internal::do_intersect_bbox_segment_aux<...>(...)>:
  40c480:           f2 0f 11 7c 24 e0           movsd  %xmm7,-0x20(%rsp)
  40c486:           66 0f ef ff                 pxor   %xmm7,%xmm7
  40c48a:           31 c0                       xor    %eax,%eax
  40c48c:           66 44 0f 28 c0              movapd %xmm0,%xmm8
  40c491:           66 0f 2f fb                 comisd %xmm3,%xmm7
  40c495:           f2 0f 10 44 24 10           movsd  0x10(%rsp),%xmm0
  40c49b:           f2 0f 11 74 24 d0           movsd  %xmm6,-0x30(%rsp)
  40c4a1:           66 0f 28 f3                 movapd %xmm3,%xmm6
  40c4a5:           f2 41 0f 5c f0              subsd  %xmm8,%xmm6
  40c4aa:           f2 0f 11 44 24 d8           movsd  %xmm0,-0x28(%rsp)
  40c4b0:           f2 0f 10 44 24 18           movsd  0x18(%rsp),%xmm0
  40c4b6:           0f 97 c0                    seta   %al
  40c4b9:           31 f6                       xor    %esi,%esi
  40c4bb:           66 0f 2f fc                 comisd %xmm4,%xmm7
  40c4bf:           f2 0f 11 44 24 e8           movsd  %xmm0,-0x18(%rsp)
  40c4c5:           48 89 c7                    mov    %rax,%rdi
  40c4c8:           f2 0f 5c e1                 subsd  %xmm1,%xmm4
  40c4cc:           f2 0f 10 44 24 08           movsd  0x8(%rsp),%xmm0
  40c4d2:           f2 0f 10 5c c4 d0           movsd  -0x30(%rsp,%rax,8),%xmm3
  40c4d8:           b8 01 00 00 00              mov    $0x1,%eax
  40c4dd:           89 c2                       mov    %eax,%edx
  40c4df:           40 0f 97 c6                 seta   %sil
  40c4e3:           31 c9                       xor    %ecx,%ecx
  40c4e5:           f2 0f 11 44 24 f0           movsd  %xmm0,-0x10(%rsp)
  40c4eb:           f2 0f 10 44 24 20           movsd  0x20(%rsp),%xmm0
  40c4f1:           66 0f 2f fd                 comisd %xmm5,%xmm7
  40c4f5:           f2 41 0f 5c d8              subsd  %xmm8,%xmm3
  40c4fa:           f2 0f 5c ea                 subsd  %xmm2,%xmm5
  40c4fe:           f2 0f 11 44 24 f8           movsd  %xmm0,-0x8(%rsp)
  40c504:           f2 0f 5e de                 divsd  %xmm6,%xmm3
  40c508:           0f 97 c1                    seta   %cl
  40c50b:           29 fa                       sub    %edi,%edx
  40c50d:           48 63 d2                    movslq %edx,%rdx
  40c510:           f2 0f 10 44 d4 d0           movsd  -0x30(%rsp,%rdx,8),%xmm0
  40c516:           48 63 d6                    movslq %esi,%rdx
  40c519:           f2 41 0f 5c c0              subsd  %xmm8,%xmm0
  40c51e:           f2 44 0f 10 44 d4 e0        movsd  -0x20(%rsp,%rdx,8),%xmm8
  40c525:           89 c2                       mov    %eax,%edx
  40c527:           29 c8                       sub    %ecx,%eax
  40c529:           29 f2                       sub    %esi,%edx
  40c52b:           48 98                       cltq
  40c52d:           48 63 d2                    movslq %edx,%rdx
  40c530:           f2 44 0f 5c c1              subsd  %xmm1,%xmm8
  40c535:           f2 0f 5e c6                 divsd  %xmm6,%xmm0
  40c539:           f2 0f 10 74 d4 e0           movsd  -0x20(%rsp,%rdx,8),%xmm6
  40c53f:           48 63 d1                    movslq %ecx,%rdx
  40c542:           f2 0f 5c f1                 subsd  %xmm1,%xmm6
  40c546:           f2 0f 10 4c c4 f0           movsd  -0x10(%rsp,%rax,8),%xmm1
  40c54c:           f2 0f 5c ca                 subsd  %xmm2,%xmm1
  40c550:           f2 44 0f 5e c4              divsd  %xmm4,%xmm8
  40c555:           f2 0f 5e f4                 divsd  %xmm4,%xmm6
  40c559:           f2 0f 10 64 d4 f0           movsd  -0x10(%rsp,%rdx,8),%xmm4
  40c55f:           31 d2                       xor    %edx,%edx
  40c561:           f2 0f 5c e2                 subsd  %xmm2,%xmm4
  40c565:           f2 44 0f 5f c3              maxsd  %xmm3,%xmm8
  40c56a:           f2 0f 5e e5                 divsd  %xmm5,%xmm4
  40c56e:           f2 0f 5d f0                 minsd  %xmm0,%xmm6
  40c572:           f2 0f 5e cd                 divsd  %xmm5,%xmm1
  40c576:           f2 41 0f 5f e0              maxsd  %xmm8,%xmm4
  40c57b:           f2 0f 5d ce                 minsd  %xmm6,%xmm1
  40c57f:           66 0f 2f cc                 comisd %xmm4,%xmm1
  40c583:       /-- 72 07                       jb     40c58c <CGAL::Intersections::internal::do_intersect_bbox_segment_aux<...>(...)+0x10c>
  40c585:       |   66 0f 2f f9                 comisd %xmm1,%xmm7
  40c589:       |   0f 96 c2                    setbe  %dl
  40c58c:       \-> 31 c0                       xor    %eax,%eax
  40c58e:           88 d0                       mov    %dl,%al
  40c590:           88 d4                       mov    %dl,%ah
  40c592:           c3                          retq

My previous work produced something that was nearly branchless, but clearly it wasn't quite perfect! The short-circuiting behavior causes the same problems it did in my last experience, but we can fix it with the same solution.

Short-circuits removed
template<
        typename FT,
        typename BFT,
        bool bounded_0,
        bool bounded_1,
        bool use_static_filters
>
typename Do_intersect_bbox_segment_aux_is_greater<
        FT,
        bounded_0,
        use_static_filters
>::result_type
do_intersect_bbox_segment_aux(
        const FT &px, const FT &py, const FT &pz,
        const FT &qx, const FT &qy, const FT &qz,
        const BFT &bxmin, const BFT &bymin, const BFT &bzmin,
        const BFT &bxmax, const BFT &bymax, const BFT &bzmax) {
 
    const BFT bx[] = {bxmin, bxmax};
    const BFT by[] = {bymin, bymax};
    const BFT bz[] = {bzmin, bzmax};
 
    const int sx = qx < 0;
    const int sy = qy < 0;
    const int sz = qz < 0;
 
    // Determine bounds x, y, and z
    FT xmin = (bx[sx] - px) / (qx - px);
    FT xmax = (bx[1 - sx] - px) / (qx - px);
    FT ymin = (by[sy] - py) / (qy - py);
    FT ymax = (by[1 - sy] - py) / (qy - py);
    FT zmin = (bz[sz] - pz) / (qz - pz);
    FT zmax = (bz[1 - sz] - pz) / (qz - pz);
 
    // Determine the bounds of the overlapping region
    FT min = std::max({xmin, ymin, zmin});
    FT max = std::min({xmax, ymax, zmax});
 
    // The ray intercepts if this region overlaps with the interval provided
    Uncertain<bool> new_result =
            (max >= min) &            // Check if there is any overlap at all
            !(bounded_0 && max < 0) & // Check if the overlap is outside the ray (before the start)
            !(bounded_1 && min > 1);  // Check if the overlap is outside the ray (after the end)
 
    return new_result;
}
000000000040c460 <CGAL::Intersections::internal::do_intersect_bbox_segment_aux<...>(...)>:
  40c460:       f2 0f 11 7c 24 e0       movsd  %xmm7,-0x20(%rsp)
  40c466:       66 0f ef ff             pxor   %xmm7,%xmm7
  40c46a:       31 c0                   xor    %eax,%eax
  40c46c:       66 44 0f 28 c0          movapd %xmm0,%xmm8
  40c471:       66 0f 2f fb             comisd %xmm3,%xmm7
  40c475:       f2 0f 10 44 24 10       movsd  0x10(%rsp),%xmm0
  40c47b:       f2 0f 11 74 24 d0       movsd  %xmm6,-0x30(%rsp)
  40c481:       66 0f 28 f3             movapd %xmm3,%xmm6
  40c485:       f2 41 0f 5c f0          subsd  %xmm8,%xmm6
  40c48a:       f2 0f 11 44 24 d8       movsd  %xmm0,-0x28(%rsp)
  40c490:       f2 0f 10 44 24 18       movsd  0x18(%rsp),%xmm0
  40c496:       0f 97 c0                seta   %al
  40c499:       31 f6                   xor    %esi,%esi
  40c49b:       66 0f 2f fc             comisd %xmm4,%xmm7
  40c49f:       f2 0f 11 44 24 e8       movsd  %xmm0,-0x18(%rsp)
  40c4a5:       48 89 c7                mov    %rax,%rdi
  40c4a8:       f2 0f 5c e1             subsd  %xmm1,%xmm4
  40c4ac:       f2 0f 10 44 24 08       movsd  0x8(%rsp),%xmm0
  40c4b2:       f2 0f 10 5c c4 d0       movsd  -0x30(%rsp,%rax,8),%xmm3
  40c4b8:       b8 01 00 00 00          mov    $0x1,%eax
  40c4bd:       89 c2                   mov    %eax,%edx
  40c4bf:       40 0f 97 c6             seta   %sil
  40c4c3:       31 c9                   xor    %ecx,%ecx
  40c4c5:       f2 0f 11 44 24 f0       movsd  %xmm0,-0x10(%rsp)
  40c4cb:       f2 0f 10 44 24 20       movsd  0x20(%rsp),%xmm0
  40c4d1:       66 0f 2f fd             comisd %xmm5,%xmm7
  40c4d5:       f2 41 0f 5c d8          subsd  %xmm8,%xmm3
  40c4da:       f2 0f 5c ea             subsd  %xmm2,%xmm5
  40c4de:       f2 0f 11 44 24 f8       movsd  %xmm0,-0x8(%rsp)
  40c4e4:       f2 0f 5e de             divsd  %xmm6,%xmm3
  40c4e8:       0f 97 c1                seta   %cl
  40c4eb:       29 fa                   sub    %edi,%edx
  40c4ed:       48 63 d2                movslq %edx,%rdx
  40c4f0:       f2 0f 10 44 d4 d0       movsd  -0x30(%rsp,%rdx,8),%xmm0
  40c4f6:       48 63 d6                movslq %esi,%rdx
  40c4f9:       f2 41 0f 5c c0          subsd  %xmm8,%xmm0
  40c4fe:       f2 44 0f 10 44 d4 e0    movsd  -0x20(%rsp,%rdx,8),%xmm8
  40c505:       89 c2                   mov    %eax,%edx
  40c507:       29 c8                   sub    %ecx,%eax
  40c509:       29 f2                   sub    %esi,%edx
  40c50b:       48 98                   cltq
  40c50d:       48 63 d2                movslq %edx,%rdx
  40c510:       f2 44 0f 5c c1          subsd  %xmm1,%xmm8
  40c515:       f2 0f 5e c6             divsd  %xmm6,%xmm0
  40c519:       f2 0f 10 74 d4 e0       movsd  -0x20(%rsp,%rdx,8),%xmm6
  40c51f:       48 63 d1                movslq %ecx,%rdx
  40c522:       f2 0f 5c f1             subsd  %xmm1,%xmm6
  40c526:       f2 0f 10 4c c4 f0       movsd  -0x10(%rsp,%rax,8),%xmm1
  40c52c:       f2 0f 5c ca             subsd  %xmm2,%xmm1
  40c530:       f2 0f 5e f4             divsd  %xmm4,%xmm6
  40c534:       f2 44 0f 5e c4          divsd  %xmm4,%xmm8
  40c539:       f2 0f 10 64 d4 f0       movsd  -0x10(%rsp,%rdx,8),%xmm4
  40c53f:       f2 0f 5c e2             subsd  %xmm2,%xmm4
  40c543:       f2 0f 5d f0             minsd  %xmm0,%xmm6
  40c547:       f2 0f 5e cd             divsd  %xmm5,%xmm1
  40c54b:       f2 44 0f 5f c3          maxsd  %xmm3,%xmm8
  40c550:       f2 0f 5e e5             divsd  %xmm5,%xmm4
  40c554:       f2 0f 5d ce             minsd  %xmm6,%xmm1
  40c558:       66 0f 2f f9             comisd %xmm1,%xmm7
  40c55c:       0f 96 c0                setbe  %al
  40c55f:       f2 41 0f 5f e0          maxsd  %xmm8,%xmm4
  40c564:       66 0f 2f cc             comisd %xmm4,%xmm1
  40c568:       0f 93 c2                setae  %dl
  40c56b:       21 d0                   and    %edx,%eax
  40c56d:       31 d2                   xor    %edx,%edx
  40c56f:       88 c2                   mov    %al,%dl
  40c571:       88 c6                   mov    %al,%dh
  40c573:       89 d0                   mov    %edx,%eax
  40c575:       c3                      retq

By using bitwise operators in place of normal boolean logic, we get true straight-line code.

I should emphasize that the fact that FT is a double is only incidental, and this can't be relied upon for general use. Neither of these solutions behave exactly the same as the original version either, and they're liable to be giving false precision by never returning uncertain results. Nevertheless, it's useful to have a branchless version of the intersection function, and this lets us evaluate the advantages of some approaches that can be combined with it.

4.8.2 Benchmarking the Child-Skipping Optimization

Last month, Andreas experimented with adding a child-skipping optimization to the aabb-tree's traversal code. By retrieving the grandchildren of each node, we can compare our bounding boxes in groups of 4 at a time. My intention is to examine how the Child-Skipping optimization combines with our branchless intersection code. Because the child skipping optimization uses the intersection function with groups of bounding boxes at a time, the hope is that when those are inlined they can be combined into a vectorized version of the same function.

To this end, I would like to be able to use the child-skipping approach in benchmarks, once this is done it becomes trivial to integrate it with the branchless intersection code (which is already working in the benchmark on another branch). Unfortunately because I'm using remesh_polyhedral_surface_sm as my benchmark, the child-skipping approach isn't immediately compatible with existing code.

I had hoped the conversion process would be as simple as enabling the 4-way traversal with a #define, as in Andreas' example code. The first issue I encountered was that there are actually a large number of intersection traits types provided by the AABB-tree, and I had to add the 4-way intersection function to each of them. That helped, but I then discovered that the program also used a traits class defined in the Polygon_mesh_processing package: Ray_3_Triangle_3_traversal_traits. Some traits classes have slightly different behavior, which meant that I couldn't simply factor the 4 way intersection out. Ray_3_Triangle_3_traversal_traits has some unique behavior of its own, but it also has a template specialization that uses a much more elaborate intersection function! This intersection function (which conditionally invokes the branchless one) is the source of a lot of trouble. It's challenging to produce a straight-line version of it, and it may also prevent the branchless code it invokes from being inlined the way I'd like.

These complications make child-skipping a much less appealing optimization; I originally thought that it could be done in a low-impact way, only affecting certain traversal functions, instead it requires invasive changes to the way traversal is done. We have no guarantee that this change would actually be a net improvement to performance, let alone a quantitative prediction for its value. As a result, I'm considering shelving this change and dedicating time to examining more straightforward solutions instead.

4.8.3 Preparing to implement N-Way Splitting

4.8.3.1 Planning

This list of steps is by no means complete, it primarily focuses on the member variables the Node type is composed of. Each of the changes has implications for the rest of the code, for example putting my pointers in a small array will require changes to the relevant accessors. The idea is that changes to how the data is structured will be the driver for the rest of the work. I expect each change to cause minor issues throughout the code-base, which I can fix before moving on to the next step. By breaking things up in this way I keep the code close to a working state at every point during the development. I expect the largest challenges to come from other packages in CGAL with a dependence on particular implementation details of the existing tree; with luck I'll be able to change them to better respect the AABB-tree's abstractions.

Steps:

  • Convert children from a pair of independent pointers to an array of pointers with size 2.
    void *left, *right;std::array<void *, 2> children;
  • Use std::variant to differentiate between child pointers and primitive pointers (rather than using a void *)
    std::array<void *, 2> children;std::array<std::variant<Node *, Primitive *>, 2> children;
  • Replace array of pointers with pointer to array, to pack the relevant data closer together.
    std::array<std::variant<Node *, Primitive *>, 2> children;std::array<std::variant<Node, Primitive *>, 2> *children;
  • Replace uses of left and right children with operations applied to the list.
    func(n.left()); func(n.right());for (const auto &c : n.children()) func(c);
  • Parametrize size of children list using a hard-coded value or a #define.
    std::array<std::variant<Node *, Primitive *>, 2> children;std::array<std::variant<Node *, Primitive *>, N> children;
  • Replace hard-coded child count with a template parameter for the node, having a default value of 2.
  • Add child count template parameter to tree's Traits class, using that value when instantiating the node class (again, with a default value of 2).
  • Change default value to 4 or 8.

Additionally, once these changes are made it becomes simpler to pack the bounding boxes efficiently:

  • Move BBox into parent node.
  • Break BBoxes into efficiently packed parallel arrays.

4.8.3.2 Child Array

The first planned change was to replace the independent children with an std::array containing them. This change went very quickly, and because the Node class is nicely encapsulated the difference could be accounted for simply by altering the accessors. I'm using the aabb_any_all_benchmark.cpp test to verify the program, and it actually showed a marginal but consistent improvement in performance! This improvement is around 1% for all kernels used, and may come from std::array's semantics, (for example, perhaps both children can be initialized to nullptr in a more efficient way).

4.8.3.3 std::variant Children

Because CGAL doesn't yet depend on C++17, I substituted boost::variant for std::variant, this went almost as smoothly as switching to an array, besides some extra effort to initialize the node with nullptrs of the correct type. variant provides some additional safety guarantees, but that comes with a performance penalty; the same program runs a little over 1% slower than the original.

4.8.3.4 Pointer-to-array replaces Array-of-pointers

Though not strictly necessary, it seemed prudent to replace my array of pointers of nodes with a pointer to an array of nodes. This could theoretically produce simpler syntax, but more importantly it guarantees that nodes with the same parent will be adjacent in memory.

The first complication I ran into here was that the nodes are already being stored in an underlying vector. Whenever we allocate new nodes, they get emplaced at the back of the vector. This does mean that nodes hold references to other places in the vector, which would be invalidated if the vector were to reallocate. To avoid this, the vector has enough space reserved beforehand to avoid the need to reallocate for even the largest (degenerate case) tree. To have pointers to arrays, I actually need to have pointers to subsets of the underlying node vector. CGAL isn't on C++20 (yet), so I can't use std::span; instead, I'm using a bare pointer to the group's first element as a stopgap solution.

The next issue was that I needed to change the contents of the underlying vector. Because nodes were allowed to contain a combination of primitives and other nodes, the vector needed to be able to do the same. As an temporary solution, this is done with the help of boost::variant. In the next step, I'm planning to change the structure of the tree, so that each node can contain either a pair of nodes or a primitive. This moves the use of variant into the Node type, reduces pointer chaining, and opens the door to experimenting with bucket sizes in the future.

I expected this to perform worse than previous methods, primarily because the underlying node vector is much more poorly packed. My expectations were borne out, as the new code performs nearly 10% worse.

4.8.3.5 Single-primitive leaves

I was expecting this change to be more challenging, because it's a fundamental difference in how the tree is constructed.

In the original tree structure, a node could point to a pair of child nodes (not shown), or a child node and a primitive (the root node), or a pair of primitives (the first node on the right).

          N
         / \
        P   N
           / \
          P   P

With the new way I'm defining the node type, it can either point to a pair of child nodes (which are contiguous in memory) or a single primitive.

    N
     \
      {N, N}
       |   \
       P    {N, N}
             |  |
             P  P

The implementation actually turned out simple (each node contains an std::variant determining whether it has children or a primitive), and the Node class is well encapsulated, so I could hide the change with shim code in the left_data and right_data methods. I'll need to remove explicit references to "left" and "right" when I increase the number of children, so these methods will be temporary.

I expected this to hurt performance, but if it does that's counteracted by the node's better arrangement in memory. This performs significantly better than the last approach, only a couple of percent worse than the original.

4.8.3.6 Simplified Traversal Method

This is where the previous changes start to pay off. A node now contains either a single primitive, or the set of its children (for now, always a pair). This is a reduction in the number of valid states a node can be in, and actually simplifies the traversal function. Not only is this an improvement for readability, it actually results in better performance, too. The traversal benchmarks run between 5 and 15% faster than the original version, depending on the kernel!

4.8.3.7 Simplified Ray-Traversal Method

The ray_intersection() method in AABB_ray_intersection.h behaves a little bit differently than the tree's built in traversal function. That method can be used to find out if there are any intersections with the query, or to collect all intersecting primitives, depending on the provided traversal traits. This method specifically finds the first intersected primitive (closest to the source), naturally, this is a concept specific to the ray type. The most clearly relevant use here is rendering, where it's obvious that only the closest intersection is needed.

To achieve this functionality efficiently, a very different approach is used. A priority queue is used to look at the nearest nodes first, and the closest intersection is updated whenever a closer one is found. The traversal doesn't stop until the next nearest node is further than current closest intersection, because none of its children could possibly be closer.

Similar to the other traversal method, the algorithm is complicated by the number of states the node is allowed to be in. Now that we have a less complicated node type, we can simplify this method significantly. Here, the simplifications actually hurt performance by around 10%. My theory is that before primitives would immediately be checked for intersection when they were encountered, but now they are being added to the queue, since they have their own nodes with bounding boxes. There are workarounds to prevent this from happening, which I plan to explore once other optimizations have been applied.

4.8.3.8 Eliminating use of "left" and "right"

The left/right data accessors actually fell out of use when I simplified the traversal functions. Because they didn't reflect the structure of the tree anymore, they didn't have any reason to appear in the new version of the functions.

The left/right child accessors were a more complicated situation. These make perfect sense in the context of the tree's current structure, but if I want to have more than two accessors than it's obvious that they can't be used. Instead, wherever an operation was performed on the left and then the right children, I substituted an iterator over the children. This tended to complicate things only minimally, the main problems came from where left and right nodes aren't treated uniformly. When distributing primitives between a pair of nodes the total count was divided in half to find the count for the left node, and the right node received any that were left over. This is a nice solution for a pair of nodes because it doesn't require special considerations for rounding, but it won't work for larger numbers of nodes. Instead, I created a function that decides how many primitives each child should get. What's convenient is that this function could distribute the primitives however I'd like, it just needs to be consistent because it will be used for both construction and traversal. Naturally, the best solution is to make the distribution as even as possible.

This change did hurt performance, and I suspect that these newly introduced loops aren't being unrolled. I also suspect that the code for determining how many primitives a node should contain is performing poorly. Overall, benchmarks run around 50% slower.

4.9 July 17 - July 24

4.9.1 Enabling N-Way Splitting

All of my preparatory work paid off, and getting a tree with N-way splits was accomplished by simply replacing all instances of a hard-coded "2" with a variable N. By changing N, I successfully produced trees of different widths.

Changing N to 4 results in a tree that is traversed marginally slower (a couple of percent); this makes sense, because generally it's now necessary to intersect with more bounding boxes as we descend the tree. This doesn't become an advantage until we begin using SIMD to perform those intersections at minimal additional cost.

I also discovered that the earlier change which massively hurt performance was actually due to omitting the traits.go_further() check that stops the search early. Reintroducing this check actually greatly improved performance, and the current code is actually only around 12% slower than the original.

A clear future step is to define a solid API for the modified tree (for example, N should be a template parameter), but serious time shouldn't be spent on that until after we can show a performance advantage.

4.9.2 Cleaning up Traversal code

The following is my solution for performing a traversal on a N-way tree. The intention of splitting the recursive into two for loops is to enable SIMD between the different invocations of the non-recursive loop. If we calculated each box intersection just before deciding to recurse, there would be no potential for them to combined into a batched operation.

  AABB_node<Tr>::traversal(const Query &query,
                           Traversal_traits &traits,
                           const std::size_t nb_primitives) const {
    if (!traits.go_further()) return;
 
    // This is a Depth-first traversal
 
    if (nb_primitives == 1) {
 
      // If there's only one primitive, we've reached a leaf node
      traits.intersection(query, data());
 
    } else {
 
      // Determine whether each child intersects
      std::array<bool, N> intersections{false};
      for (size_t i = 0; i < N; ++i)
        intersections[i] = traits.do_intersect(query, children()[i]);
 
      // Recursively check each child that intersected
      for (size_t i = 0; i < N; ++i) {
        if (intersections[i])
          children()[i].traversal(query, traits, num_primitives(i, nb_primitives));
      }
    }

This is where it becomes clear that we're reaching the limits of optimization that can be done simply by analyzing assembly. When this is compiled along with the branchless intersection method from earlier, the result is a tangle of jumps which is a challenge to interpret.

0000000000435830 <void CGAL::AABB_node<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default> >::traversal<CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >, CGAL::Line_3<CGAL::Simple_cartesian<float> > >(CGAL::Line_3<CGAL::Simple_cartesian<float> > const&, CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >&, unsigned long) const>:
  435830:	      /-------------------------------------------------------> 41 56                	push   %r14
  435832:	      |                                                         41 55                	push   %r13
  435834:	      |                                                         49 89 d5             	mov    %rdx,%r13
  435837:	      |                                                         41 54                	push   %r12
  435839:	      |                                                         49 89 fc             	mov    %rdi,%r12
  43583c:	      |                                                         55                   	push   %rbp
  43583d:	      |                                                         48 89 f5             	mov    %rsi,%rbp
  435840:	      |                                                         53                   	push   %rbx
  435841:	      |                                                         48 89 cb             	mov    %rcx,%rbx
  435844:	      |                                                         48 83 ec 50          	sub    $0x50,%rsp
  435848:	      |  /----------------------------------------------------> 41 80 7d 00 00       	cmpb   $0x0,0x0(%r13)
  43584d:	/-----|--|----------------------------------------------------- 0f 85 54 02 00 00    	jne    435aa7 <void CGAL::AABB_node<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default> >::traversal<CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >, CGAL::Line_3<CGAL::Simple_cartesian<float> > >(CGAL::Line_3<CGAL::Simple_cartesian<float> > const&, CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >&, unsigned long) const+0x277>
  435853:	|     |  |                                                      49 8b 7c 24 30       	mov    0x30(%r12),%rdi
  435858:	|     |  |                                                      48 83 fb 01          	cmp    $0x1,%rbx
  43585c:	|  /--|--|----------------------------------------------------- 0f 84 d6 03 00 00    	je     435c38 <void CGAL::AABB_node<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default> >::traversal<CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >, CGAL::Line_3<CGAL::Simple_cartesian<float> > >(CGAL::Line_3<CGAL::Simple_cartesian<float> > const&, CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >&, unsigned long) const+0x408>
  435862:	|  |  |  |                                                      f3 0f 10 6d 14       	movss  0x14(%rbp),%xmm5
  435867:	|  |  |  |                                                      f3 0f 10 7d 0c       	movss  0xc(%rbp),%xmm7
  43586c:	|  |  |  |                                                      66 0f ef c0          	pxor   %xmm0,%xmm0
  435870:	|  |  |  |                                                      48 89 f8             	mov    %rdi,%rax
  435873:	|  |  |  |                                                      f3 44 0f 10 4d 10    	movss  0x10(%rbp),%xmm9
  435879:	|  |  |  |                                                      48 8d 54 24 1c       	lea    0x1c(%rsp),%rdx
  43587e:	|  |  |  |                                                      4c 8d 44 24 20       	lea    0x20(%rsp),%r8
  435883:	|  |  |  |                                                      c7 44 24 1c 00 00 00 	movl   $0x0,0x1c(%rsp)
  43588a:	|  |  |  |                                                      00
  43588b:	|  |  |  |                                                      44 0f 28 d5          	movaps %xmm5,%xmm10
  43588f:	|  |  |  |                                                      0f 28 f7             	movaps %xmm7,%xmm6
  435892:	|  |  |  |                                                      f3 0f 11 6c 24 0c    	movss  %xmm5,0xc(%rsp)
  435898:	|  |  |  |                                                      0f 28 cf             	movaps %xmm7,%xmm1
  43589b:	|  |  |  |                                                      f3 44 0f 59 d0       	mulss  %xmm0,%xmm10
  4358a0:	|  |  |  |                                                      45 0f 28 c1          	movaps %xmm9,%xmm8
  4358a4:	|  |  |  |                                                      41 0f 28 d9          	movaps %xmm9,%xmm3
  4358a8:	|  |  |  |                                                      f3 44 0f 59 c0       	mulss  %xmm0,%xmm8
  4358ad:	|  |  |  |                                                      f3 44 0f 58 55 08    	addss  0x8(%rbp),%xmm10
  4358b3:	|  |  |  |                                                      0f 57 0d f6 4f 04 00 	xorps  0x44ff6(%rip),%xmm1        # 47a8b0 <vtable for CGAL::Polyhedron_scan_OFF<CGAL::HalfedgeDS_default<CGAL::Epick, CGAL::I_Polyhedron_derived_items_3<CGAL::Polyhedron_items_3>, std::allocator<int> > >+0x308>
  4358ba:	|  |  |  |                                                      f3 0f 59 f0          	mulss  %xmm0,%xmm6
  4358be:	|  |  |  |                                                      0f 57 2d eb 4f 04 00 	xorps  0x44feb(%rip),%xmm5        # 47a8b0 <vtable for CGAL::Polyhedron_scan_OFF<CGAL::HalfedgeDS_default<CGAL::Epick, CGAL::I_Polyhedron_derived_items_3<CGAL::Polyhedron_items_3>, std::allocator<int> > >+0x308>
  4358c5:	|  |  |  |                                                      0f 57 1d e4 4f 04 00 	xorps  0x44fe4(%rip),%xmm3        # 47a8b0 <vtable for CGAL::Polyhedron_scan_OFF<CGAL::HalfedgeDS_default<CGAL::Epick, CGAL::I_Polyhedron_derived_items_3<CGAL::Polyhedron_items_3>, std::allocator<int> > >+0x308>
  4358cc:	|  |  |  |                                                      f3 44 0f 58 45 04    	addss  0x4(%rbp),%xmm8
  4358d2:	|  |  |  |                                                      f3 0f 58 75 00       	addss  0x0(%rbp),%xmm6
  4358d7:	|  |  |  |                                                      66 0f 7e c9          	movd   %xmm1,%ecx
  4358db:	|  |  |  |                                                      f3 44 0f 11 54 24 08 	movss  %xmm10,0x8(%rsp)
  4358e2:	|  |  |  |                                                      66 41 0f 7e eb       	movd   %xmm5,%r11d
  4358e7:	|  |  |  |                                                      66 41 0f 7e da       	movd   %xmm3,%r10d
  4358ec:	|  |  |  |                                               /----> 0f 2f f8             	comiss %xmm0,%xmm7
  4358ef:	|  |  |  |                                               |      66 0f ef c9          	pxor   %xmm1,%xmm1
  4358f3:	|  |  |  |                                               |      66 0f ef db          	pxor   %xmm3,%xmm3
  4358f7:	|  |  |  |                                               |      f2 0f 10 50 20       	movsd  0x20(%rax),%xmm2
  4358fc:	|  |  |  |                                               |      f2 44 0f 10 58 28    	movsd  0x28(%rax),%xmm11
  435902:	|  |  |  |                                               |      f2 44 0f 10 50 10    	movsd  0x10(%rax),%xmm10
  435908:	|  |  |  |                                               |      f2 0f 5a 48 18       	cvtsd2ss 0x18(%rax),%xmm1
  43590d:	|  |  |  |                                               |      f2 0f 5a 18          	cvtsd2ss (%rax),%xmm3
  435911:	|  |  |  |                                               |      f2 0f 10 60 08       	movsd  0x8(%rax),%xmm4
  435916:	|  |  |  |                 /-----------------------------|----- 0f 82 9c 01 00 00    	jb     435ab8 <void CGAL::AABB_node<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default> >::traversal<CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >, CGAL::Line_3<CGAL::Simple_cartesian<float> > >(CGAL::Line_3<CGAL::Simple_cartesian<float> > const&, CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >&, unsigned long) const+0x288>
  43591c:	|  |  |  |                 |                             |      0f 28 eb             	movaps %xmm3,%xmm5
  43591f:	|  |  |  |                 |                             |      f3 0f 5c ce          	subss  %xmm6,%xmm1
  435923:	|  |  |  |                 |                             |      0f 28 df             	movaps %xmm7,%xmm3
  435926:	|  |  |  |                 |                             |      f3 0f 5c ee          	subss  %xmm6,%xmm5
  43592a:	|  |  |  |                 |  /--------------------------|----> 0f 2e d8             	ucomiss %xmm0,%xmm3
  43592d:	|  |  |  |                 |  |                          |  /-- 7a 17                	jp     435946 <void CGAL::AABB_node<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default> >::traversal<CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >, CGAL::Line_3<CGAL::Simple_cartesian<float> > >(CGAL::Line_3<CGAL::Simple_cartesian<float> > const&, CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >&, unsigned long) const+0x116>
  43592f:	|  |  |  |                 |  |                          |  +-- 75 15                	jne    435946 <void CGAL::AABB_node<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default> >::traversal<CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >, CGAL::Line_3<CGAL::Simple_cartesian<float> > >(CGAL::Line_3<CGAL::Simple_cartesian<float> > const&, CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >&, unsigned long) const+0x116>
  435931:	|  |  |  |                 |  |                          |  |   45 31 f6             	xor    %r14d,%r14d
  435934:	|  |  |  |                 |  |                          |  |   0f 2f e8             	comiss %xmm0,%xmm5
  435937:	|  |  |  |        /--------|--|--------------------------|--|-- 0f 87 23 01 00 00    	ja     435a60 <void CGAL::AABB_node<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default> >::traversal<CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >, CGAL::Line_3<CGAL::Simple_cartesian<float> > >(CGAL::Line_3<CGAL::Simple_cartesian<float> > const&, CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >&, unsigned long) const+0x230>
  43593d:	|  |  |  |        |        |  |                          |  |   0f 2f c1             	comiss %xmm1,%xmm0
  435940:	|  |  |  |        +--------|--|--------------------------|--|-- 0f 87 1a 01 00 00    	ja     435a60 <void CGAL::AABB_node<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default> >::traversal<CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >, CGAL::Line_3<CGAL::Simple_cartesian<float> > >(CGAL::Line_3<CGAL::Simple_cartesian<float> > const&, CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >&, unsigned long) const+0x230>
  435946:	|  |  |  |        |        |  |                          |  \-> 44 0f 2f c8          	comiss %xmm0,%xmm9
  43594a:	|  |  |  |        |        |  |                          |      f2 0f 5a d2          	cvtsd2ss %xmm2,%xmm2
  43594e:	|  |  |  |        |        |  |                          |      f2 0f 5a e4          	cvtsd2ss %xmm4,%xmm4
  435952:	|  |  |  |        |  /-----|--|--------------------------|----- 0f 82 d8 01 00 00    	jb     435b30 <void CGAL::AABB_node<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default> >::traversal<CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >, CGAL::Line_3<CGAL::Simple_cartesian<float> > >(CGAL::Line_3<CGAL::Simple_cartesian<float> > const&, CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >&, unsigned long) const+0x300>
  435958:	|  |  |  |        |  |     |  |                          |      f3 41 0f 5c e0       	subss  %xmm8,%xmm4
  43595d:	|  |  |  |        |  |     |  |                          |      f3 41 0f 5c d0       	subss  %xmm8,%xmm2
  435962:	|  |  |  |        |  |     |  |                          |      45 0f 28 e1          	movaps %xmm9,%xmm12
  435966:	|  |  |  |        |  |     |  |                          |      44 0f 28 fc          	movaps %xmm4,%xmm15
  43596a:	|  |  |  |        |  |  /--|--|--------------------------|----> 44 0f 2e e0          	ucomiss %xmm0,%xmm12
  43596e:	|  |  |  |        |  |  |  |  |              /-----------|----- 0f 8a 5c 01 00 00    	jp     435ad0 <void CGAL::AABB_node<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default> >::traversal<CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >, CGAL::Line_3<CGAL::Simple_cartesian<float> > >(CGAL::Line_3<CGAL::Simple_cartesian<float> > const&, CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >&, unsigned long) const+0x2a0>
  435974:	|  |  |  |        |  |  |  |  |              +-----------|----- 0f 85 56 01 00 00    	jne    435ad0 <void CGAL::AABB_node<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default> >::traversal<CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >, CGAL::Line_3<CGAL::Simple_cartesian<float> > >(CGAL::Line_3<CGAL::Simple_cartesian<float> > const&, CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >&, unsigned long) const+0x2a0>
  43597a:	|  |  |  |        |  |  |  |  |              |           |      45 31 f6             	xor    %r14d,%r14d
  43597d:	|  |  |  |        |  |  |  |  |              |           |      44 0f 2f f8          	comiss %xmm0,%xmm15
  435981:	|  |  |  |        +--|--|--|--|--------------|-----------|----- 0f 87 d9 00 00 00    	ja     435a60 <void CGAL::AABB_node<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default> >::traversal<CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >, CGAL::Line_3<CGAL::Simple_cartesian<float> > >(CGAL::Line_3<CGAL::Simple_cartesian<float> > const&, CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >&, unsigned long) const+0x230>
  435987:	|  |  |  |        |  |  |  |  |              |           |      0f 2f c2             	comiss %xmm2,%xmm0
  43598a:	|  |  |  |        +--|--|--|--|--------------|-----------|----- 0f 87 d0 00 00 00    	ja     435a60 <void CGAL::AABB_node<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default> >::traversal<CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >, CGAL::Line_3<CGAL::Simple_cartesian<float> > >(CGAL::Line_3<CGAL::Simple_cartesian<float> > const&, CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >&, unsigned long) const+0x230>
  435990:	|  |  |  |        |  |  |  |  |              |           |      44 0f 28 f3          	movaps %xmm3,%xmm14
  435994:	|  |  |  |        |  |  |  |  |              |           |      44 0f 28 ed          	movaps %xmm5,%xmm13
  435998:	|  |  |  |        |  |  |  |  |              |           |      0f 28 e3             	movaps %xmm3,%xmm4
  43599b:	|  |  |  |        |  |  |  |  |              |           |      f3 45 0f 59 f7       	mulss  %xmm15,%xmm14
  4359a0:	|  |  |  |        |  |  |  |  |              |           |      f3 45 0f 59 ec       	mulss  %xmm12,%xmm13
  4359a5:	|  |  |  |        |  |  |  |  |              |           |      f3 0f 59 e2          	mulss  %xmm2,%xmm4
  4359a9:	|  |  |  |        |  |  |  |  |              |           |      66 44 0f 7e f6       	movd   %xmm14,%esi
  4359ae:	|  |  |  |        |  |  |  |  |              |           |      44 0f 28 f1          	movaps %xmm1,%xmm14
  4359b2:	|  |  |  |        |  |  |  |  |              |           |      f3 45 0f 59 f4       	mulss  %xmm12,%xmm14
  4359b7:	|  |  |  |        |  |  |  |  |              |           |      f3 0f 11 24 24       	movss  %xmm4,(%rsp)
  4359bc:	|  |  |  |        |  |  |  |  |              |           |      f3 44 0f 11 74 24 04 	movss  %xmm14,0x4(%rsp)
  4359c3:	|  |  |  |        |  |  |  |  |              |           |      66 44 0f 6e f6       	movd   %esi,%xmm14
  4359c8:	|  |  |  |        |  |  |  |  |              |           |      45 0f 2f f5          	comiss %xmm13,%xmm14
  4359cc:	|  |  |  |        |  |  |  |  |  /-----------|-----------|----- 0f 87 7e 01 00 00    	ja     435b50 <void CGAL::AABB_node<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default> >::traversal<CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >, CGAL::Line_3<CGAL::Simple_cartesian<float> > >(CGAL::Line_3<CGAL::Simple_cartesian<float> > const&, CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >&, unsigned long) const+0x320>
  4359d2:	|  |  |  |        |  |  |  |  |  |           |     /-----|----> 44 0f 28 eb          	movaps %xmm3,%xmm13
  4359d6:	|  |  |  |        |  |  |  |  |  |  /--------|-----|-----|----> f3 44 0f 10 7c 24 04 	movss  0x4(%rsp),%xmm15
  4359dd:	|  |  |  |        |  |  |  |  |  |  |        |     |     |      f3 0f 10 24 24       	movss  (%rsp),%xmm4
  4359e2:	|  |  |  |        |  |  |  |  |  |  |        |     |     |      f3 41 0f c2 e7 05    	cmpnltss %xmm15,%xmm4
  4359e8:	|  |  |  |        |  |  |  |  |  |  |        |     |     |      0f 54 dc             	andps  %xmm4,%xmm3
  4359eb:	|  |  |  |        |  |  |  |  |  |  |        |     |     |      44 0f 28 f4          	movaps %xmm4,%xmm14
  4359ef:	|  |  |  |        |  |  |  |  |  |  |        |     |     |      0f 54 cc             	andps  %xmm4,%xmm1
  4359f2:	|  |  |  |        |  |  |  |  |  |  |        |     |     |      0f 55 e2             	andnps %xmm2,%xmm4
  4359f5:	|  |  |  |        |  |  |  |  |  |  |        |     |     |      45 0f 55 f4          	andnps %xmm12,%xmm14
  4359f9:	|  |  |  |        |  |  |  |  |  |  |        |     |     |      66 0f ef d2          	pxor   %xmm2,%xmm2
  4359fd:	|  |  |  |        |  |  |  |  |  |  |        |     |     |      0f 56 cc             	orps   %xmm4,%xmm1
  435a00:	|  |  |  |        |  |  |  |  |  |  |        |     |     |      66 0f ef e4          	pxor   %xmm4,%xmm4
  435a04:	|  |  |  |        |  |  |  |  |  |  |        |     |     |      41 0f 56 de          	orps   %xmm14,%xmm3
  435a08:	|  |  |  |        |  |  |  |  |  |  |        |     |     |      f2 41 0f 5a d2       	cvtsd2ss %xmm10,%xmm2
  435a0d:	|  |  |  |        |  |  |  |  |  |  |        |     |     |      f2 41 0f 5a e3       	cvtsd2ss %xmm11,%xmm4
  435a12:	|  |  |  |        |  |  |  |  |  |  |        |     |     |      f3 44 0f 10 5c 24 0c 	movss  0xc(%rsp),%xmm11
  435a19:	|  |  |  |        |  |  |  |  |  |  |        |     |     |      44 0f 2f d8          	comiss %xmm0,%xmm11
  435a1d:	|  |  |  |  /-----|--|--|--|--|--|--|--------|-----|-----|----- 0f 82 ed 01 00 00    	jb     435c10 <void CGAL::AABB_node<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default> >::traversal<CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >, CGAL::Line_3<CGAL::Simple_cartesian<float> > >(CGAL::Line_3<CGAL::Simple_cartesian<float> > const&, CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >&, unsigned long) const+0x3e0>
  435a23:	|  |  |  |  |     |  |  |  |  |  |  |        |     |     |      f3 44 0f 10 54 24 08 	movss  0x8(%rsp),%xmm10
  435a2a:	|  |  |  |  |     |  |  |  |  |  |  |        |     |     |      f3 41 0f 5c d2       	subss  %xmm10,%xmm2
  435a2f:	|  |  |  |  |     |  |  |  |  |  |  |        |     |     |      f3 41 0f 5c e2       	subss  %xmm10,%xmm4
  435a34:	|  |  |  |  |     |  |  |  |  |  |  |        |     |     |      44 0f 28 e2          	movaps %xmm2,%xmm12
  435a38:	|  |  |  |  |     |  |  |  |  |  |  |        |     |     |      41 0f 28 d3          	movaps %xmm11,%xmm2
  435a3c:	|  |  |  |  |  /--|--|--|--|--|--|--|--------|-----|-----|----> f3 41 0f 59 e5       	mulss  %xmm13,%xmm4
  435a41:	|  |  |  |  |  |  |  |  |  |  |  |  |        |     |     |      45 31 f6             	xor    %r14d,%r14d
  435a44:	|  |  |  |  |  |  |  |  |  |  |  |  |        |     |     |      f3 0f 59 ea          	mulss  %xmm2,%xmm5
  435a48:	|  |  |  |  |  |  |  |  |  |  |  |  |        |     |     |      0f 2f e5             	comiss %xmm5,%xmm4
  435a4b:	|  |  |  |  |  |  +--|--|--|--|--|--|--------|-----|-----|----- 72 13                	jb     435a60 <void CGAL::AABB_node<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default> >::traversal<CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >, CGAL::Line_3<CGAL::Simple_cartesian<float> > >(CGAL::Line_3<CGAL::Simple_cartesian<float> > const&, CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >&, unsigned long) const+0x230>
  435a4d:	|  |  |  |  |  |  |  |  |  |  |  |  |        |     |     |      f3 41 0f 59 dc       	mulss  %xmm12,%xmm3
  435a52:	|  |  |  |  |  |  |  |  |  |  |  |  |        |     |     |      f3 0f 59 ca          	mulss  %xmm2,%xmm1
  435a56:	|  |  |  |  |  |  |  |  |  |  |  |  |        |     |     |      0f 2f cb             	comiss %xmm3,%xmm1
  435a59:	|  |  |  |  |  |  |  |  |  |  |  |  |        |     |     |      41 0f 93 c6          	setae  %r14b
  435a5d:	|  |  |  |  |  |  |  |  |  |  |  |  |        |     |     |      0f 1f 00             	nopl   (%rax)
  435a60:	|  |  |  |  |  |  >--|--|--|--|--|--|--------|-----|-----|----> 44 88 32             	mov    %r14b,(%rdx)
  435a63:	|  |  |  |  |  |  |  |  |  |  |  |  |        |     |     |      48 83 c2 01          	add    $0x1,%rdx
  435a67:	|  |  |  |  |  |  |  |  |  |  |  |  |        |     |     |      48 83 c0 38          	add    $0x38,%rax
  435a6b:	|  |  |  |  |  |  |  |  |  |  |  |  |        |     |     |      49 39 d0             	cmp    %rdx,%r8
  435a6e:	|  |  |  |  |  |  |  |  |  |  |  |  |        |     |     \----- 0f 85 78 fe ff ff    	jne    4358ec <void CGAL::AABB_node<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default> >::traversal<CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >, CGAL::Line_3<CGAL::Simple_cartesian<float> > >(CGAL::Line_3<CGAL::Simple_cartesian<float> > const&, CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >&, unsigned long) const+0xbc>
  435a74:	|  |  |  |  |  |  |  |  |  |  |  |  |        |     |            49 89 de             	mov    %rbx,%r14
  435a77:	|  |  |  |  |  |  |  |  |  |  |  |  |        |     |            49 c1 ee 02          	shr    $0x2,%r14
  435a7b:	|  |  |  |  |  |  |  |  |  |  |  |  |        |     |            80 7c 24 1c 00       	cmpb   $0x0,0x1c(%rsp)
  435a80:	|  |  |  |  |  |  |  |  |  |  |  |  |        |  /--|----------- 0f 85 5a 01 00 00    	jne    435be0 <void CGAL::AABB_node<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default> >::traversal<CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >, CGAL::Line_3<CGAL::Simple_cartesian<float> > >(CGAL::Line_3<CGAL::Simple_cartesian<float> > const&, CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >&, unsigned long) const+0x3b0>
  435a86:	|  |  |  |  |  |  |  |  |  |  |  |  |        |  |  |            80 7c 24 1d 00       	cmpb   $0x0,0x1d(%rsp)
  435a8b:	|  |  |  |  |  |  |  |  |  |  |  |  |  /-----|--|--|----------- 0f 85 1f 01 00 00    	jne    435bb0 <void CGAL::AABB_node<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default> >::traversal<CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >, CGAL::Line_3<CGAL::Simple_cartesian<float> > >(CGAL::Line_3<CGAL::Simple_cartesian<float> > const&, CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >&, unsigned long) const+0x380>
  435a91:	|  |  |  |  |  |  |  |  |  |  |  |  |  |  /--|--|--|----------> 80 7c 24 1e 00       	cmpb   $0x0,0x1e(%rsp)
  435a96:	|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  /-------- 0f 85 dc 00 00 00    	jne    435b78 <void CGAL::AABB_node<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default> >::traversal<CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >, CGAL::Line_3<CGAL::Simple_cartesian<float> > >(CGAL::Line_3<CGAL::Simple_cartesian<float> > const&, CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >&, unsigned long) const+0x348>
  435a9c:	|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  /----> 80 7c 24 1f 00       	cmpb   $0x0,0x1f(%rsp)
  435aa1:	|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  /-- 0f 85 b9 00 00 00    	jne    435b60 <void CGAL::AABB_node<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default> >::traversal<CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >, CGAL::Line_3<CGAL::Simple_cartesian<float> > >(CGAL::Line_3<CGAL::Simple_cartesian<float> > const&, CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >&, unsigned long) const+0x330>
  435aa7:	>--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|--|-> 48 83 c4 50          	add    $0x50,%rsp
  435aab:	|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |   5b                   	pop    %rbx
  435aac:	|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |   5d                   	pop    %rbp
  435aad:	|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |   41 5c                	pop    %r12
  435aaf:	|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |   41 5d                	pop    %r13
  435ab1:	|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |   41 5e                	pop    %r14
  435ab3:	|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |   c3                   	retq
  435ab4:	|  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |  |   0f 1f 40 00          	nopl   0x0(%rax)
  435ab8:	|  |  |  |  |  |  |  |  |  \--|--|--|--|--|--|--|--|--|--|--|-> 0f 28 ee             	movaps %xmm6,%xmm5
  435abb:	|  |  |  |  |  |  |  |  |     |  |  |  |  |  |  |  |  |  |  |   f3 0f 5c e9          	subss  %xmm1,%xmm5
  435abf:	|  |  |  |  |  |  |  |  |     |  |  |  |  |  |  |  |  |  |  |   0f 28 ce             	movaps %xmm6,%xmm1
  435ac2:	|  |  |  |  |  |  |  |  |     |  |  |  |  |  |  |  |  |  |  |   f3 0f 5c cb          	subss  %xmm3,%xmm1
  435ac6:	|  |  |  |  |  |  |  |  |     |  |  |  |  |  |  |  |  |  |  |   66 0f 6e d9          	movd   %ecx,%xmm3
  435aca:	|  |  |  |  |  |  |  |  |     \--|--|--|--|--|--|--|--|--|--|-- e9 5b fe ff ff       	jmpq   43592a <void CGAL::AABB_node<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default> >::traversal<CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >, CGAL::Line_3<CGAL::Simple_cartesian<float> > >(CGAL::Line_3<CGAL::Simple_cartesian<float> > const&, CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >&, unsigned long) const+0xfa>
  435acf:	|  |  |  |  |  |  |  |  |        |  |  |  |  |  |  |  |  |  |   90                   	nop
  435ad0:	|  |  |  |  |  |  |  |  |        |  |  |  |  \--|--|--|--|--|-> 0f 28 e3             	movaps %xmm3,%xmm4
  435ad3:	|  |  |  |  |  |  |  |  |        |  |  |  |     |  |  |  |  |   44 0f 28 ed          	movaps %xmm5,%xmm13
  435ad7:	|  |  |  |  |  |  |  |  |        |  |  |  |     |  |  |  |  |   45 31 f6             	xor    %r14d,%r14d
  435ada:	|  |  |  |  |  |  |  |  |        |  |  |  |     |  |  |  |  |   f3 0f 59 e2          	mulss  %xmm2,%xmm4
  435ade:	|  |  |  |  |  |  |  |  |        |  |  |  |     |  |  |  |  |   f3 45 0f 59 ec       	mulss  %xmm12,%xmm13
  435ae3:	|  |  |  |  |  |  |  |  |        |  |  |  |     |  |  |  |  |   f3 0f 11 24 24       	movss  %xmm4,(%rsp)
  435ae8:	|  |  |  |  |  |  |  |  |        |  |  |  |     |  |  |  |  |   44 0f 2f ec          	comiss %xmm4,%xmm13
  435aec:	|  |  |  |  |  |  +--|--|--------|--|--|--|-----|--|--|--|--|-- 0f 87 6e ff ff ff    	ja     435a60 <void CGAL::AABB_node<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default> >::traversal<CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >, CGAL::Line_3<CGAL::Simple_cartesian<float> > >(CGAL::Line_3<CGAL::Simple_cartesian<float> > const&, CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >&, unsigned long) const+0x230>
  435af2:	|  |  |  |  |  |  |  |  |        |  |  |  |     |  |  |  |  |   44 0f 28 f3          	movaps %xmm3,%xmm14
  435af6:	|  |  |  |  |  |  |  |  |        |  |  |  |     |  |  |  |  |   f3 45 0f 59 f7       	mulss  %xmm15,%xmm14
  435afb:	|  |  |  |  |  |  |  |  |        |  |  |  |     |  |  |  |  |   66 44 0f 7e f6       	movd   %xmm14,%esi
  435b00:	|  |  |  |  |  |  |  |  |        |  |  |  |     |  |  |  |  |   44 0f 28 f1          	movaps %xmm1,%xmm14
  435b04:	|  |  |  |  |  |  |  |  |        |  |  |  |     |  |  |  |  |   f3 45 0f 59 f4       	mulss  %xmm12,%xmm14
  435b09:	|  |  |  |  |  |  |  |  |        |  |  |  |     |  |  |  |  |   66 0f 6e e6          	movd   %esi,%xmm4
  435b0d:	|  |  |  |  |  |  |  |  |        |  |  |  |     |  |  |  |  |   41 0f 2f e6          	comiss %xmm14,%xmm4
  435b11:	|  |  |  |  |  |  |  |  |        |  |  |  |     |  |  |  |  |   f3 44 0f 11 74 24 04 	movss  %xmm14,0x4(%rsp)
  435b18:	|  |  |  |  |  |  \--|--|--------|--|--|--|-----|--|--|--|--|-- 0f 87 42 ff ff ff    	ja     435a60 <void CGAL::AABB_node<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default> >::traversal<CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >, CGAL::Line_3<CGAL::Simple_cartesian<float> > >(CGAL::Line_3<CGAL::Simple_cartesian<float> > const&, CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >&, unsigned long) const+0x230>
  435b1e:	|  |  |  |  |  |     |  |        |  |  |  |     |  |  |  |  |   66 44 0f 6e f6       	movd   %esi,%xmm14
  435b23:	|  |  |  |  |  |     |  |        |  |  |  |     |  |  |  |  |   45 0f 2f f5          	comiss %xmm13,%xmm14
  435b27:	|  |  |  |  |  |     |  |        |  |  |  |     |  \--|--|--|-- 0f 86 a5 fe ff ff    	jbe    4359d2 <void CGAL::AABB_node<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default> >::traversal<CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >, CGAL::Line_3<CGAL::Simple_cartesian<float> > >(CGAL::Line_3<CGAL::Simple_cartesian<float> > const&, CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >&, unsigned long) const+0x1a2>
  435b2d:	|  |  |  |  |  |     |  |        +--|--|--|-----|-----|--|--|-- eb 21                	jmp    435b50 <void CGAL::AABB_node<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default> >::traversal<CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >, CGAL::Line_3<CGAL::Simple_cartesian<float> > >(CGAL::Line_3<CGAL::Simple_cartesian<float> > const&, CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >&, unsigned long) const+0x320>
  435b2f:	|  |  |  |  |  |     |  |        |  |  |  |     |     |  |  |   90                   	nop
  435b30:	|  |  |  |  |  |     \--|--------|--|--|--|-----|-----|--|--|-> 45 0f 28 f8          	movaps %xmm8,%xmm15
  435b34:	|  |  |  |  |  |        |        |  |  |  |     |     |  |  |   66 45 0f 6e e2       	movd   %r10d,%xmm12
  435b39:	|  |  |  |  |  |        |        |  |  |  |     |     |  |  |   f3 44 0f 5c fa       	subss  %xmm2,%xmm15
  435b3e:	|  |  |  |  |  |        |        |  |  |  |     |     |  |  |   41 0f 28 d0          	movaps %xmm8,%xmm2
  435b42:	|  |  |  |  |  |        |        |  |  |  |     |     |  |  |   f3 0f 5c d4          	subss  %xmm4,%xmm2
  435b46:	|  |  |  |  |  |        \--------|--|--|--|-----|-----|--|--|-- e9 1f fe ff ff       	jmpq   43596a <void CGAL::AABB_node<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default> >::traversal<CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >, CGAL::Line_3<CGAL::Simple_cartesian<float> > >(CGAL::Line_3<CGAL::Simple_cartesian<float> > const&, CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >&, unsigned long) const+0x13a>
  435b4b:	|  |  |  |  |  |                 |  |  |  |     |     |  |  |   0f 1f 44 00 00       	nopl   0x0(%rax,%rax,1)
  435b50:	|  |  |  |  |  |                 \--|--|--|-----|-----|--|--|-> 41 0f 28 ef          	movaps %xmm15,%xmm5
  435b54:	|  |  |  |  |  |                    |  |  |     |     |  |  |   45 0f 28 ec          	movaps %xmm12,%xmm13
  435b58:	|  |  |  |  |  |                    \--|--|-----|-----|--|--|-- e9 79 fe ff ff       	jmpq   4359d6 <void CGAL::AABB_node<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default> >::traversal<CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >, CGAL::Line_3<CGAL::Simple_cartesian<float> > >(CGAL::Line_3<CGAL::Simple_cartesian<float> > const&, CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >&, unsigned long) const+0x1a6>
  435b5d:	|  |  |  |  |  |                       |  |     |     |  |  |   0f 1f 00             	nopl   (%rax)
  435b60:	|  |  |  |  |  |                       |  |     |     |  |  >-> 4d 8b 64 24 30       	mov    0x30(%r12),%r12
  435b65:	|  |  |  |  |  |                       |  |     |     |  |  |   4c 89 f3             	mov    %r14,%rbx
  435b68:	|  |  |  |  |  |                       |  |     |     |  |  |   49 81 c4 a8 00 00 00 	add    $0xa8,%r12
  435b6f:	|  |  |  \--|--|-----------------------|--|-----|-----|--|--|-- e9 d4 fc ff ff       	jmpq   435848 <void CGAL::AABB_node<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default> >::traversal<CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >, CGAL::Line_3<CGAL::Simple_cartesian<float> > >(CGAL::Line_3<CGAL::Simple_cartesian<float> > const&, CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >&, unsigned long) const+0x18>
  435b74:	|  |  |     |  |                       |  |     |     |  |  |   0f 1f 40 00          	nopl   0x0(%rax)
  435b78:	|  |  |     |  |                       |  |     |     >--|--|-> 83 e3 03             	and    $0x3,%ebx
  435b7b:	|  |  |     |  |                       |  |     |     |  |  |   49 8b 44 24 30       	mov    0x30(%r12),%rax
  435b80:	|  |  |     |  |                       |  |     |     |  |  |   31 c9                	xor    %ecx,%ecx
  435b82:	|  |  |     |  |                       |  |     |     |  |  |   4c 89 ea             	mov    %r13,%rdx
  435b85:	|  |  |     |  |                       |  |     |     |  |  |   48 83 fb 02          	cmp    $0x2,%rbx
  435b89:	|  |  |     |  |                       |  |     |     |  |  |   48 89 ee             	mov    %rbp,%rsi
  435b8c:	|  |  |     |  |                       |  |     |     |  |  |   0f 97 c1             	seta   %cl
  435b8f:	|  |  |     |  |                       |  |     |     |  |  |   48 8d 78 70          	lea    0x70(%rax),%rdi
  435b93:	|  |  |     |  |                       |  |     |     |  |  |   4c 01 f1             	add    %r14,%rcx
  435b96:	|  |  +-----|--|-----------------------|--|-----|-----|--|--|-- e8 95 fc ff ff       	callq  435830 <void CGAL::AABB_node<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default> >::traversal<CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >, CGAL::Line_3<CGAL::Simple_cartesian<float> > >(CGAL::Line_3<CGAL::Simple_cartesian<float> > const&, CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >&, unsigned long) const>
  435b9b:	|  |  |     |  |                       |  |     |     |  |  |   80 7c 24 1f 00       	cmpb   $0x0,0x1f(%rsp)
  435ba0:	+--|--|-----|--|-----------------------|--|-----|-----|--|--|-- 0f 84 01 ff ff ff    	je     435aa7 <void CGAL::AABB_node<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default> >::traversal<CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >, CGAL::Line_3<CGAL::Simple_cartesian<float> > >(CGAL::Line_3<CGAL::Simple_cartesian<float> > const&, CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >&, unsigned long) const+0x277>
  435ba6:	|  |  |     |  |                       |  |     |     |  |  \-- eb b8                	jmp    435b60 <void CGAL::AABB_node<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default> >::traversal<CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >, CGAL::Line_3<CGAL::Simple_cartesian<float> > >(CGAL::Line_3<CGAL::Simple_cartesian<float> > const&, CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >&, unsigned long) const+0x330>
  435ba8:	|  |  |     |  |                       |  |     |     |  |      0f 1f 84 00 00 00 00 	nopl   0x0(%rax,%rax,1)
  435baf:	|  |  |     |  |                       |  |     |     |  |      00
  435bb0:	|  |  |     |  |                       >--|-----|-----|--|----> 48 89 d9             	mov    %rbx,%rcx
  435bb3:	|  |  |     |  |                       |  |     |     |  |      49 8b 44 24 30       	mov    0x30(%r12),%rax
  435bb8:	|  |  |     |  |                       |  |     |     |  |      4c 89 ea             	mov    %r13,%rdx
  435bbb:	|  |  |     |  |                       |  |     |     |  |      48 89 ee             	mov    %rbp,%rsi
  435bbe:	|  |  |     |  |                       |  |     |     |  |      48 d1 e9             	shr    %rcx
  435bc1:	|  |  |     |  |                       |  |     |     |  |      83 e1 01             	and    $0x1,%ecx
  435bc4:	|  |  |     |  |                       |  |     |     |  |      48 8d 78 38          	lea    0x38(%rax),%rdi
  435bc8:	|  |  |     |  |                       |  |     |     |  |      4c 01 f1             	add    %r14,%rcx
  435bcb:	|  |  +-----|--|-----------------------|--|-----|-----|--|----- e8 60 fc ff ff       	callq  435830 <void CGAL::AABB_node<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default> >::traversal<CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >, CGAL::Line_3<CGAL::Simple_cartesian<float> > >(CGAL::Line_3<CGAL::Simple_cartesian<float> > const&, CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >&, unsigned long) const>
  435bd0:	|  |  |     |  |                       |  |     |     |  |      80 7c 24 1e 00       	cmpb   $0x0,0x1e(%rsp)
  435bd5:	|  |  |     |  |                       |  |     |     |  \----- 0f 84 c1 fe ff ff    	je     435a9c <void CGAL::AABB_node<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default> >::traversal<CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >, CGAL::Line_3<CGAL::Simple_cartesian<float> > >(CGAL::Line_3<CGAL::Simple_cartesian<float> > const&, CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >&, unsigned long) const+0x26c>
  435bdb:	|  |  |     |  |                       |  |     |     \-------- eb 9b                	jmp    435b78 <void CGAL::AABB_node<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default> >::traversal<CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >, CGAL::Line_3<CGAL::Simple_cartesian<float> > >(CGAL::Line_3<CGAL::Simple_cartesian<float> > const&, CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >&, unsigned long) const+0x348>
  435bdd:	|  |  |     |  |                       |  |     |               0f 1f 00             	nopl   (%rax)
  435be0:	|  |  |     |  |                       |  |     \-------------> 48 89 d8             	mov    %rbx,%rax
  435be3:	|  |  |     |  |                       |  |                     4c 89 f1             	mov    %r14,%rcx
  435be6:	|  |  |     |  |                       |  |                     4c 89 ea             	mov    %r13,%rdx
  435be9:	|  |  |     |  |                       |  |                     48 89 ee             	mov    %rbp,%rsi
  435bec:	|  |  |     |  |                       |  |                     83 e0 03             	and    $0x3,%eax
  435bef:	|  |  |     |  |                       |  |                     48 83 f8 01          	cmp    $0x1,%rax
  435bf3:	|  |  |     |  |                       |  |                     48 83 d9 ff          	sbb    $0xffffffffffffffff,%rcx
  435bf7:	|  |  \-----|--|-----------------------|--|-------------------- e8 34 fc ff ff       	callq  435830 <void CGAL::AABB_node<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default> >::traversal<CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >, CGAL::Line_3<CGAL::Simple_cartesian<float> > >(CGAL::Line_3<CGAL::Simple_cartesian<float> > const&, CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >&, unsigned long) const>
  435bfc:	|  |        |  |                       |  |                     80 7c 24 1d 00       	cmpb   $0x0,0x1d(%rsp)
  435c01:	|  |        |  |                       |  \-------------------- 0f 84 8a fe ff ff    	je     435a91 <void CGAL::AABB_node<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default> >::traversal<CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >, CGAL::Line_3<CGAL::Simple_cartesian<float> > >(CGAL::Line_3<CGAL::Simple_cartesian<float> > const&, CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >&, unsigned long) const+0x261>
  435c07:	|  |        |  |                       \----------------------- eb a7                	jmp    435bb0 <void CGAL::AABB_node<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default> >::traversal<CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >, CGAL::Line_3<CGAL::Simple_cartesian<float> > >(CGAL::Line_3<CGAL::Simple_cartesian<float> > const&, CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >&, unsigned long) const+0x380>
  435c09:	|  |        |  |                                                0f 1f 80 00 00 00 00 	nopl   0x0(%rax)
  435c10:	|  |        \--|----------------------------------------------> f3 44 0f 10 7c 24 08 	movss  0x8(%rsp),%xmm15
  435c17:	|  |           |                                                45 0f 28 e7          	movaps %xmm15,%xmm12
  435c1b:	|  |           |                                                f3 44 0f 5c e4       	subss  %xmm4,%xmm12
  435c20:	|  |           |                                                41 0f 28 e7          	movaps %xmm15,%xmm4
  435c24:	|  |           |                                                f3 0f 5c e2          	subss  %xmm2,%xmm4
  435c28:	|  |           |                                                66 41 0f 6e d3       	movd   %r11d,%xmm2
  435c2d:	|  |           \----------------------------------------------- e9 0a fe ff ff       	jmpq   435a3c <void CGAL::AABB_node<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default> >::traversal<CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >, CGAL::Line_3<CGAL::Simple_cartesian<float> > >(CGAL::Line_3<CGAL::Simple_cartesian<float> > const&, CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >&, unsigned long) const+0x20c>
  435c32:	|  |                                                            66 0f 1f 44 00 00    	nopw   0x0(%rax,%rax,1)
  435c38:	|  \----------------------------------------------------------> 48 8b 07             	mov    (%rdi),%rax
  435c3b:	|                                                               48 8d 7c 24 20       	lea    0x20(%rsp),%rdi
  435c40:	|                                                               48 8b 00             	mov    (%rax),%rax
  435c43:	|                                                               48 8b 10             	mov    (%rax),%rdx
  435c46:	|                                                               48 8b 48 08          	mov    0x8(%rax),%rcx
  435c4a:	|                                                               48 8b 40 18          	mov    0x18(%rax),%rax
  435c4e:	|                                                               48 8b 49 18          	mov    0x18(%rcx),%rcx
  435c52:	|                                                               48 8b 52 18          	mov    0x18(%rdx),%rdx
  435c56:	|                                                               48 8b 70 08          	mov    0x8(%rax),%rsi
  435c5a:	|                                                               8b 40 10             	mov    0x10(%rax),%eax
  435c5d:	|                                                               48 89 74 24 20       	mov    %rsi,0x20(%rsp)
  435c62:	|                                                               48 89 ee             	mov    %rbp,%rsi
  435c65:	|                                                               89 44 24 28          	mov    %eax,0x28(%rsp)
  435c69:	|                                                               48 8b 41 08          	mov    0x8(%rcx),%rax
  435c6d:	|                                                               48 89 44 24 2c       	mov    %rax,0x2c(%rsp)
  435c72:	|                                                               8b 41 10             	mov    0x10(%rcx),%eax
  435c75:	|                                                               89 44 24 34          	mov    %eax,0x34(%rsp)
  435c79:	|                                                               48 8b 42 08          	mov    0x8(%rdx),%rax
  435c7d:	|                                                               48 89 44 24 38       	mov    %rax,0x38(%rsp)
  435c82:	|                                                               8b 42 10             	mov    0x10(%rdx),%eax
  435c85:	|                                                               89 44 24 40          	mov    %eax,0x40(%rsp)
  435c89:	|                                                               e8 a2 77 fd ff       	callq  40d430 <bool CGAL::Intersections::internal::do_intersect<CGAL::Simple_cartesian<float> >(CGAL::Simple_cartesian<float>::Triangle_3 const&, CGAL::Simple_cartesian<float>::Line_3 const&, CGAL::Simple_cartesian<float> const&) [clone .isra.0]>
  435c8e:	|                                                               84 c0                	test   %al,%al
  435c90:	+-------------------------------------------------------------- 0f 84 11 fe ff ff    	je     435aa7 <void CGAL::AABB_node<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default> >::traversal<CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >, CGAL::Line_3<CGAL::Simple_cartesian<float> > >(CGAL::Line_3<CGAL::Simple_cartesian<float> > const&, CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >&, unsigned long) const+0x277>
  435c96:	|                                                               41 c6 45 00 01       	movb   $0x1,0x0(%r13)
  435c9b:	\-------------------------------------------------------------- e9 07 fe ff ff       	jmpq   435aa7 <void CGAL::AABB_node<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default> >::traversal<CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >, CGAL::Line_3<CGAL::Simple_cartesian<float> > >(CGAL::Line_3<CGAL::Simple_cartesian<float> > const&, CGAL::internal::AABB_tree::Do_intersect_traits<CGAL::AABB_traits<CGAL::Simple_cartesian<float>, CGAL::AABB_face_graph_triangle_primitive<CGAL::Polyhedron_3<CGAL::Simple_cartesian<float>, CGAL::Polyhedron_items_3, CGAL::HalfedgeDS_default, std::allocator<int> >, CGAL::Default, CGAL::Boolean_tag<true>, CGAL::Boolean_tag<false> >, CGAL::Default>, CGAL::Line_3<CGAL::Simple_cartesian<float> > >&, unsigned long) const+0x277>

The issue here isn't the number of jumps though: even when the intersection code is branchless, the compiler misses the opportunity to unroll a loop and vectorize across the inlined functions!

4.9.3 Benchmarking Different types of Intersections

Pierre and I have been looking at the idea of putting a using a bounding box (or several) around the query to speed up traversal. This strategy could make a big difference, because CGAL's bbox-bbox intersection doesn't require any special care to preserve exactness; it's algorithmically simpler, and it is far more readily compatible with SIMD techniques. The hope is that we can get some performance advantage by not doing intersections directly with the query until it's absolutely necessary.

We can get an idea of how big the performance advantage could be by examining the time it takes to perform different types of intersections. A boxed-query approach would replace ray-bbox intersections with simpler but broader bbox-bbox intersections. This means we may intersect with more nodes, but each individual intersection computation will be cheaper. Pierre's intuition was that bbox-bbox intersections should be at least 10x faster than ray-bbox intersections, and perhaps 100x faster than intersections with primitives. To find real-life numbers for these, I created a simple benchmark of the different intersections. This benchmark tested intersections between 1,000 of the query type and 10,000 of the target type, repeated 100 times with interleaved runs as usual. The total test takes a couple of minutes to run, but I divided by the total number of intersections to get the time for just one.

Time required to check for intersections between different types
Intersection TypeTime per Intersection (s)
Bbox-Bbox2.5e-14
Ray-Bbox1.34778e-08
Ray-Triangle8.25178e-08

Rather than 10x faster than ray-bbox intersections, bbox-bbox intersections were more than 500,000x faster! The difference likely won't be quite as spectacular in a real context (in the tree the more expensive part is probably chasing pointers), but these results make me pretty optimistic that we can get a performance improvement!

4.9.4 Implementing a simple Boxed Query technique

My solution for simple boxing of queries has the following features:

  • A Boxed_query<Q> class, which contains a reference to the query object and a bounding box that encloses it.
  • A constructor for the Boxed query which takes a query type and an Iso_cuboid boundary. This facilitates infinite query types like rays, cutting them at the border so that they can have a finite bounding box. (There's also a simpler constructor for non-infinite types)
  • A custom CGAL::do_intersect function which tries to rule out the possibility of an intersection using the bounding box, only falling back to direct intersection with the query if it's necessary.

While I was developing this, I modified the benchmarking code to save the boolean results of the intersections to an output vector. This let me guarantee that the results were correct, but it also massively hurt performance.

Time required to check for intersections between different types
Intersection TypeTime per Intersection (s)
Bbox-Bbox3.54384e-09
Ray-Bbox2.71308e-08
Boxed-Ray-Bbox2.08792e-08

This is promising! The boxed method is nearly 10x slower than plain box-box intersection, which makes sense because its time is likely dominated by the cases where it needs to fall back to ray intersection. Nevertheless, it still beats plain ray-box intersection, while providing equivalent functionality. This test includes the overhead of constructing the boxed query for every ray being intersected; this is actually rather expensive (it uses a boolean operation with an Iso_cuboid); a more effective method of cutting the infinite query type could result in a further performance advantage.

Removing the validation step makes the advantages a little bit more clear, with this change the boxed-ray beats the conventional method by more than 25%!

Time required to check for intersections between different types
Intersection TypeTime per Intersection (s)
Bbox-Bbox3.6e-14
Ray-Bbox2.24505e-08
Boxed-Ray-Bbox1.80355e-08

Notice how bbox-bbox intersection was impacted the most by validation. This is likely in part because it's a small function that can be vectorized well by the compiler; adding the result to a vector after each invocation of the function almost certainly interferes with vectorization between function calls. Even without that factor, its time was dominated by the overhead of saving the result.

4.10 July 25 - August 2

4.10.1 Using Boxed Queries to Traverse the Tree

With the performance potential confirmed, it made sense to try to take advantage of boxed queries when traversing the tree.

4.10.1.1 Specialized Traversal Function

The core of this strategy is a new template specialization for the traversal function, so that we can treat boxed queries specially.

template<typename Tr>
template<class Traversal_traits, class Query>
void
AABB_node<Tr>::traversal(const Boxed_query<Query> &boxed_query,
                         Traversal_traits &traits,
                         const std::size_t nb_primitives) const;

For our first prototype, we wanted to traverse with a single bbox surrounding the query. This bbox is reconstructed for each sub-node, because it only needs to surround the region of the query that falls within that node's own box. If we tried to use the same node for the entire depth of the traversal, it would quickly become much larger than the boxes it's being tested against.

Our specialization of the traversal function builds the new box for each child before it recursively traverses them, it also performs an intersection directly using the contained query directly once it reaches a primitive.

4.10.1.2 Invalidating Traversal Traits Assumptions

When I tried to directly intersect the query and the primitive in the base case of the traversal, I ran into a small snag.

The AABB_traversal_traits classes each provide two functions: bool do_intersect(const Query& query, const Node& node) for intersections with each child when determining whether to traverse recursively, and void intersection(const Query& query, const Primitive& primitive) for direct intersections with primitives (the traits class is responsible for collecting the results). The Query type is a template of the traits class, and that comes with the implicit assumption that the type will be the same for both types of intersections provided. Since we want to intersect directly using Boxed_query's underlying query type when we reach a primitive, this assumption no longer holds. As a temporary solution, these two functions are independently templated, with names DoIntersectQuery and IntersectionQuery, respectively. This shouldn't have any performance penalty, but it does cut against the grain of the Traits system's original design; eventually I think it's important to find a cleaner solution.

4.10.2 Issues with Traversal Correctness

Even with careful checking of the algorithm, the boxed query traversal was producing different results than the original "naive" solution. It consistently detected fewer intersections (6,114 vs 6,708, for example). I was unable to eliminate this issue without also taking away any performance advantage of the approach. One solution was to use a pessimized version of the algorithm: rather than assuming if the box doesn't intersect than its contained primitive must not, we can assume that if the box does intersect, so must its contained query (and if the box doesn't, try the query directly just in case). This is obviously wasteful, resulting in examining many more nodes than should be necessary, but it does guarantee that all intersections are found, producing the correct result.

Tracking these errors revealed that the missed intersections stemmed from "false negatives" when using the bbox as a filter for intersections between queries and bboxes. Examining the problematic scenarios showed that malformed boxes were not properly surrounding the underlying queries. This points to an issue in the Boxed_query constructor.

Pierre and I had a discussion, and our conclusion was that the Intersection system (doing the boolean operation between query and bbox) wasn't producing a conservative result because the kernel was inexact. For near-tangential intersections (where the ray query barely skims the surface of the bbox) the errors produced can become very pronounced. As a temporary solution, I'm testing whether using an exact construction kernel (Epec) will prevent these issues. If I get correct results with the Epec kernel, that means that this problem can be solved by creating a more conservative approach for putting a bbox around the query.

4.10.2.1 Compatibility with Epeck

Getting the boxed query system working with Epeck was more challenging than expected. Introducing it as another test case to the any_all_benchmark caused a wave of template errors; which surprised me because I had expected it to have an identical interface to the other kernels. Fixing the issues helped me discover and eliminate an unnecessary special case leftover from the old code, and it ensured that all treatments of the boxed_query are done where I expect (before I had relied on a custom cast operator).

Once Epeck was working, I was able to confirm that it always produced correct results. This means that our issue must stem from the query boxing approach not being conservative, as we thought.

4.10.3 Interval-based Boxing

After some experimentation with template specializations for lines and rays, I found that the simplest approach for adapting the existing boxing system was to use CGAL's kernel converter type. This let me make very few changes to the existing code, and preserve compatibility with the same range of Query types.

I eventually found that using Interval_nt<false> instead of Interval_nt<true> made a massive performance difference.

typedef Simple_cartesian<Interval_nt<false>> Interval_kernel;
typedef typename Kernel_traits<Query>::Kernel Query_kernel;
typedef typename Type_mapper<Query, Query_kernel, Interval_kernel>::type Interval_query;
typedef Cartesian_converter<Query_kernel, Interval_kernel> Converter_K_to_Interval;
 
...
 
 
Converter_K_to_Interval K_to_Interval;
interval_query = K_to_Interval(query);
 
...
 
static Bbox_3 bbox_for_enclosed_segment(const Interval_query &query, const Bbox_3 &boundary) {
      Interval_nt<false>::Protector p;
 
      auto b = Iso_cuboid_3<Interval_kernel>(boundary);
      auto overlap = CGAL::intersection(query, b);
 
      // If there wasn't any overlap, return an empty bounding box
      if (!overlap)
        return Bbox_3{};
 
      // Otherwise, get the bounding box of the result
      return boost::apply_visitor([](auto &&boolean_intersection_result) -> Bbox_3 {
        return boolean_intersection_result.bbox();
      }, *overlap);
    }

This allowed for full correctness, but came with a massive performance penalty. Creating a conservative bbox is a very expensive process, and it eliminated the advantage that the boxing strategy had.

4.10.3.1 Performance Results

As usual, benchmarking is done with the aabb_any_all_benchmark. This program builds a tree from a triangle mesh that it loads from a file, and then generates a large number of segment, ray, and line queries which it casts into the tree.

Results were disappointing, with a decrease in performance for all kernels.

Performance of a Simple Query-boxing Strategy, using Interval-math to Ensure Correctness
KernelNaiveBoxed
Simple cartesian float kernel0.5713181.39495
Cartesian float kernel0.8963361.51499
Simple cartesian double kernel0.4979321.32457
Cartesian double kernel0.8341491.45415
Epic kernel1.11711.47604
Epec kernel1.448851.74818

Clearly it's expensive to create these bounding boxes with full correctness, so we'll need to employ a more elaborate strategy to reduce the time we spend on that problem. I think we have two sides to approach that problem from:

  • Reduce how expensive it is to create these bounding boxes (e.g. a simpler function to create conservative boxes using interval arithmetic)
  • Reduce how often we need to compute them (e.g. only put a new box around the query when the current box stops being useful for filtering, instead of every time we descend the tree)

Of these options, I believe that the second has more promise. Ultimately, creating a conservative bounding box can only be optimized by a certain amount, especially while retaining our genericity.

For future sections, statements on performance will focus on the Epic kernel results.

The Epic kernel is the most likely to see a useful improvement, because it has the most expensive ray-bbox intersections.

4.10.4 Tunable Parameters

During a pair programming session, Andreas and I explored several avenues for improving our performance. We identified a collection of parameters that could be tuned to improve our results.

In the following table, all variations not listed in (parentheses) have been built and benchmarked. Each is discussed in more detail below.

Tunable Parameters
Conditional Shrinking
  • Never shrink
  • Always shrink
  • Shrink only when box intersects with all children's boxes
  • (Shrink only when box is larger than children's boxes)
BBox Count1, 2, 4, (N)
Children per Tree-node1, 2, 4, 8, (N)
Trust in the Filter
  • Fall-back to direct query-bbox intersections if the box intersects
  • Never fall back to query-bbox intersections

Including the naive approach, we have a total of 73 different combinations of optimizations.

We didn't do a grid-search of the parameters, but as we implemented the different optimizations we tested a variety of combinations. Some of the parameters have independent affects on performance, but others depend strongly on what other optimization are active.

4.10.4.1 Conditional Shrinking

This parameter determines when we create a new bounding box for our boxed query. This is important because it effects two opposing aspects of our performance:

  • It can increase or decrease how often the bbox_for_enclosed_segment function is invoked, an operation we've determined to be very expensive.
  • It can decrease or increase how tightly the bbox encloses the underlying query, which affects how many nodes we end up visiting.

We applied three different approaches, but also discussed the possibility of others.

Never Shrink

Here, we build the Boxed_query at the start of the traversal and then never update it again. This has the advantage of rarely requiring building a new bbox, but it means that the bbox will dwarf the node's boxes as we get deeper in the tree. The result was that the enclosing box quickly becomes useless, and overall it detracted from performance.

Always Shrink

This is on the opposite end of the spectrum. For this strategy, we create a new (smaller) bbox each time we descend the tree into a child. This ensures that the box always fits the query as tightly as possible, but it means that bbox_for_enclosed_segment is used constantly. It does better than never shrinking, but not particularly well.

Shrink when all Children Intersect

This strategy falls between the other two. It uses a heuristic to determine whether the box needs to be made smaller. We discussed several heuristics, but the only one implemented depends on intersections with child nodes. If the current box intersected with all the child nodes, then it might be too big (it's large enough to touch them all). When this happens, we shrink the box and continue traversing the tree.

4.10.4.2 Multi-BBox

By using more than one box to wrap the primitive, we may be able to get an unintuitively large performance advantage. This is because of some nonlinearities in the way that costs and benefits scale as the number of bboxes scale:

  • Using two boxes to surround a segment instead of one reduces the total combined volume, since the boxes can be smaller. Importantly, the new boxes aren't just 1/2 the total volume -- they're 1/4 the size.
  • Creating two boxes instead of one is less than twice as expensive. Getting the start and end points requires the expensive boolean intersection operation, but to create two boxes we only need one more point in between, which can be found using a relatively inexpensive midpoint() operation.

1 Box

If we use only one BBox but shrink it for each child, it's already relatively effective at wrapping the query. Using CGAL's profiling utility we were able to determine that it eliminated around 25% of direct query-bbox intersection. This represents a sizeable reduction in the number of expensive intersections we need to do.

4 Boxes

Increasing the number of boxes began to run into diminishing returns. For four boxes, we saw a reduction of around 35%. This is good, but we were hoping that the theoretical 16x reduction in query box volume would make a bigger difference.

We're still interested in trying larger numbers of boxes. As the number gets higher, it may make a non-shrinking approach more viable.

4.10.4.3 Children per node

Increasing the number of children per node makes sense when there's an "economy of scale" for query-bbox intersections. We found that this wasn't the case for naive linear queries, and increasing the width of the tree generally decreased performance. It makes sense for Embree to use wider trees because their ray-bbox intersection code can be vectorized, allowing them to do the extra intersections simultaneously. This optimization may also be applicable to our situation, but for different reasons.

2 Children

This is the standard case, and it tended to perform the best for most configurations.

8 Children

When we use singly-boxed queries and always shrink them it tends to be worthwhile to use wider trees. In certain configurations, we even saw improvements in performance up to a 12-way tree. This makes sense, because there's a cost to shrink the bbox which is amortized over the number of times it's used for intersections. Just as important, a wider tree can be much shallower, which reduces the number of times we need to shrink the box.

4.10.4.4 Trust in the Filter

There is more than one way to use our boxed query's bounding box. We have two options, with their own benefits and drawbacks.

Fall back to direct intersections when uncertain

Our original solution was to use the box as a negative filter. This means that if the two boxes don't intersect, then it's safe to assume that the query and the node's box don't intersect, since boxes are drawn conservatively. If the two boxes do intersect, we would then do (expensive) query-bbox intersection. The code makes use of short-circuiting logic, and looks like this:

template<typename Query, typename Other>
inline
bool
might_intersect(const Boxed_query<Query> &q,
                const Other &other) {
  return CGAL::do_intersect(q.bbox, other);
}
 
template<typename Query, typename Other>
inline
bool
do_intersect(const Boxed_query<Query> &q,
             const Other &other) {
  return might_intersect(q, other) && do_intersect(q.query, other);
}

This has the advantage that the boxed query's do_intersect function is always guaranteed to produce the same result as directly intersecting with the query. This means that the traversal will visit identical nodes to the naive solution, which is useful for testing correctness. Not visiting additional nodes also has a theoretical performance advantage, though it's outweighed by the cost of falling back to a direct query-bbox intersection.

Always trust that bbox-bbox intersections are representative

I found it was better to assume all intersections between boxes correspond to real intersections with the query. Implementing this was done by making might_intersect and do_intersect equivalent.

template<typename Query, typename Other>
inline
bool
might_intersect(const Boxed_query<Query> &q,
                const Other &other) {
  return CGAL::do_intersect(q.bbox, other);
}
 
template<typename Query, typename Other>
inline
bool
do_intersect(const Boxed_query<Query> &q,
             const Other &other) {
  return might_intersect(q, other);
}

Using CGAL's profiling tools, we could determine that this is because bbox intersections are a very good predictor of real intersections. Over 95% of the time, when the boxed query's bbox intersects with the other node, so does the query itself!

When we approach the problem this way, we slightly increase the number of query-primitive intersections, that difference is vanishingly small in comparison to the number of query-bbox intersections we save.

4.10.4.5 Optimal Results

My first attempt had the following parameters:

  • Always shrink (every time we descend the tree)
  • 1 box wrapping the query
  • 2-way tree
  • Fall back to direct query-bbox intersections when uncertain

We tested a wide variety of combinations, and the best we found included the following parameters:

  • Heuristic shrinking (only when all children intersect)
  • 1 box wrapping the query
  • 2-way tree
  • Never fall back to direct query-bbox intersections

Choosing the right approaches made a significant difference to performance:

Performance of Query-boxing strategies
KernelNaiveOriginal BoxedOptimized Boxed
Simple cartesian float kernel0.5713181.394950.935523
Cartesian float kernel0.8963361.514991.0592
Simple cartesian double kernel0.4979321.324570.833513
Cartesian double kernel0.8341491.454151.054
Epic kernel1.11711.476041.0115
Epec kernel1.448851.748181.28246

Even so, this result isn't without caveats. The optimized box approach only beats the naive approach for particularly heavy kernels. This is because its value is in its reduction of heavy numerical calculations done by the kernel. Bbox-Bbox intersections are very cheap compared to most Epic kernel operations, but this advantage shrinks for something like the Simple cartesian float kernel.

4.10.5 Only Lines Benefit from this Optimization

On Andreas' suggestion, I substituted Ray queries for Lines in the benchmark, this had a rather unfortunate effect on our performance results.

Performance of Query-boxing strategies when Line Queries are disabled
KernelNaiveOptimized Boxed
Simple cartesian float kernel0.6138510.857605
Cartesian float kernel0.6841380.927448
Simple cartesian double kernel0.5309060.784656
Cartesian double kernel0.6262220.921197
Epic kernel0.8110940.897078
Epec kernel1.142181.13796

The boxing strategy's performance advantages evaporate, even for the complex kernels that it excelled on!

This is likely happening because the Ray and Segment types have cursory checks to see if their origin is inside the bbox. This is nearly as cheap as bbox-bbox intersection, and tests showed that it's enough to catch intersections a lot of the time (removing this check comes with a sizeable performance penalty) The Line type, on the other hand, is infinite and defined by a line equation. It has no such pre-check, and so it's more expensive on average. Boxing helps us add an equivalent to the pre-check for infinite types like the line, so this is where its performance benefit comes from.

This presents a problem, because the AABB is generally used for Ray and Segment queries, not lines. This makes perfect sense, since line queries are slower without a strategy like boxing. Our work here has identified an optimization path that helps only a small subset of use-cases.

4.11 August 3 - August 10

4.11.1 Feasibility of a Compact Tree Structure

Andreas and I have been discussing the possibility of accessing child nodes using pointer or index math instead of references. This would rely on the fact that the arrangement of the nodes in memory should be deterministic. It comes with several advantages:

  • More compact node type, which wouldn't have to store any pointers or references to its children.
  • Easy access to both parents and children of nodes, something that would previously have required storing an additional reference.
  • Better cache performance (in part because of the more compact representation, in part due to better locality)

I did some research into this type of method, and a relevant concept here is "succinct data structure". An even more relevant term for what we're trying to do is implicit data structure. This field of research has been around perhaps even longer than computers have, and the topic of compact trees has been very thoroughly explored. The best implementation guide I found is this, and it cites a paper from 1964. Wikipedia also has its own short description of the technique. One of the key requirements that the article lists for this type of structure is that the tree be complete, but the wikipedia page acknowledges that this isn't actually a strict requirement: if we're okay with having empty nodes (gaps in our underlying vector), we can actually have any structure we want! Luckily, even with no changes to implementation our current construction strategy already produces nearly-complete trees.

Of course, the advantages and simplicity of implementation lead to one question:

4.11.1.1 Why isn't Embree doing this?

Using an implicit structure has some drawbacks, most of which don't apply to CGAL at the moment, but would if we wanted to incorporate some of Embree's techniques.

  • Embree has a tree-building technique that's intended to produce an optimal structure, by criteria such as total bbox surface area. This is naturally more computationally expensive than normal construction, but the built tree can provide faster traversal on average. The tree that's built has no guarantees of being anywhere near complete, and if put into an index-based implementation it may result in very large blocks of empty memory.
  • Embree has strong support for dynamic trees, which can be efficiently rebuilt as the stored primitives move around. This is useful for Embree's main purpose, raytracing, because it makes it easy to perform efficient traversals of a scene with moving parts. An array-based tree is more complicated to rebuild, and may not support the type of partial-restructuring that embree uses here.

Using an index-based structure doesn't necessarily close the door to optimizations and features like these in the future, but it would make their implementation more challenging if CGAL wanted to incorporate them someday.

4.11.2 Potential API for an Implicit Tree

There exists one major difference that means conversion to a compact representation isn't straightforward: nodes don't know about their children without additional context about the rest of the tree. In the current tree, a node has a reference to its children which can be followed regardless of whether this node is the root or deep inside the tree. For an index-based tree, the offset used to find the children's index depends on the specific index of that node. The root is directly adjacent to its children, but nodes deeper in the tree will be much further from their children. As a result, to find the location of the next child we need to know both the location of the current child and that of the start of the tree.

My proposal to avoid making traversal much more complex is the addition of a AABB_node_handle type, which would encapsulate this information. You could use this type to produce handles for its parent and children, and it could provide access to the underlying bbox or primitive. It could be treated just as we do a node right now, almost as a drop-in replacement.

4.11.3 Preparing to implement Implicit Tree construction

Development will be done on the new implicit-tree-structure branch.

4.11.3.1 Planning

Conversion of the current tree to an implicit structure can be broken into steps, just as conversion to an N-way tree was.

  • Add reference to the tree's root as a member of the Node class; use the node's root to determine the index of the node index = this - root.
  • Add a new child accessor which takes the node's current index as an argument std::array<Node, N> &children()std::array<Node, N> &children(std::size_t index). Use this accessor where the original was used, and confirm their results match for shallow nodes.
  • Update tree's underlying vector to be large enough to hold the tree if it isn't stored compactly (round up the size as though the tree were perfect). Pre-allocate vector with empty nodes (nodes with empty bboxes).
  • Update tree constructor to use the new index-based accessor when determining where to allocate child nodes. (The two accessors should now produce identical results for all nodes, since the new one is determining the references stored for the old system).
  • Use boost::variant to allow the node type to hold either a bbox or a reference to a primitive. The old child pointer should now be used for exclusively that purpose.
  • Remove the old child pointer (along with correctness check). Node type should now only contain a bbox and a reference to the root.
  • Construct-on-access of child nodes: whenever we retrieve the child of a node, use its constructor to re-build it in place from its bbox and the root before returning it (this won't do anything from now).
  • Construct-on-access of root node: similar to previous change, but affects the AABB_tree type.
  • Replace vector of nodes with vector of bboxes/primitive references std::vector<AABB_node>std::vector<boost::variant<Bbox, Primitive *>>. Node type should now contain a reference to the root, and a reference to the bbox/primitive *. Update all accessors to follow this reference.
  • Rename Node to Node_handle, because it no longer directly contains the node's data.
  • Substitute union where boost::variant was used; once the system is confirmed to be working the variant's type checking adds a lot of unnecessary overhead.

4.11.3.2 "Fat nodes" intermediate solution

The first few steps in my plan produce a system where nodes are arranged as an implicit tree, and use pointer math to find their children. In order to make this work while keeping the logic inside the node, however, it was necessary to give it extra information. When each node is constructed, it is given a reference to the root node (this includes the root node, which has a reference to itself). With this information, each node has the following knowledge:

  • Its own bounding box
  • The location of the root node, at the start of the vector
  • The location of itself (through the this pointer)
  • The a pointer to a primitive (if it has one), or to the child nodes (this is redundant, but lets me check my work).

As a result, the node type is slightly larger than the original implementation. Despite this, it still performs almost exactly as well. This suggests that this benchmark is not particularly cache-limited, which makes sense given what we already know.

4.11.3.3 Making the nodes more compact

The children pointer is now strictly unnecessary, but that space is shared with the pointer to the primitive. We could stop saving a reference to the children, but that would leave us with a primitive pointer that's left null for non-leaf nodes. My solution was to combine the primitive pointer and the bbox, using a variant type: boost::variant<Bbox, Primitive *>. Unfortunately, this changes the topology of the tree in a way that reduces performance; this will be discussed in a later section.

4.11.3.4 Moving traversal logic to the tree

The node type still contains one redundant value: the root pointer. Nodes need this point of reference to calculate their children's locations, but every node holds the same value!

There are several ways I could go about eliminating this value, but after a discussion with Andreas we decided that the simplest approach would be to move the traversal code to the tree. Since the tree naturally knows where the root node is, we don't need to do anything special to hold onto it once the function is moved. node.children() is replaced by tree.children(node). Of course, since this function is primarily called from within the tree, it often looks like children(node).

Despite some initial confusion, I eventually determined that this change had no meaningful effect on performance, one way or another.

4.11.4 The source of N-Way Tree's performance advantage

Andreas and I had a conversation with Sebastien about the performance advantages we were seeing in the N-way tree. For traversals with rays and segments this was as much as a 10% advantage, so we wanted to determine the source. Evaluating some other tree implementations showed that the performance increase was likely the result of a particular difference in the tree's topology. The original AABB had nodes which could hold either two children, two primitives, or a mix. The N-way tree has nodes that can contain either a pair of children or a single primitive. What this means in practice is that each primitive is wrapped by its own bounding box, during traversal this bbox acts as a geometric filter. We previously believed that this filtering was done by the intersection function already, which would have made the bbox redundant; but that didn't turn out to be the case.

In order to confirm that this was the source of the performance gain, we added a geometric filter directly inside the traversal function of a version of the tree which didn't have this topology.

Effects of geometric filter on any_all benchmark
Tree ImplementationEpick Kernel Time (s)
Original0.97
N-way (1-node per primitive)0.88
Implicit Structure (same topology as original)0.97
Implicit Structure with Filtering0.91

Doing this filter inside the traversal function meant that for the last time, the boxes are calculated on the spot. This may seem wasteful, but it performs nearly as well as the N-way tree, while only taking up as much space as the original!

The tree class actually allows the use to provide a utility for mapping primitives to their bounding boxes. By default this simply invokes primitive.bbox(), but it could be substituted for a solution which caches the boxes. This could provide the best of both worlds: it doesn't require building boxes for primitives more than once, and it doesn't take up as much space as all those extra nodes in the tree.

4.11.5 Building a tree using a sort along the Hilbert curve

One technique used by Embree to great effect is "Spatial Sorting". This is an alternative method of construction that produces a lower quality tree, but does so very quickly. It's useful for Embree when rendering dynamic scenes; when the primitives contained by a tree move, the tree needs to be rebuilt every frame. This may be useful for CGAL as well, wherever trees are built frequently (or only used a few times). Building a high quality tree is expensive, and only worthwhile if that cost is amortized over many traversals.

To build the tree quickly, Embree starts by sorting the primitives along a space filling curve. This can be done very efficiently because it's possible to give each location on the curve a unique code, and these codes can be used in a Radix sort. When primitives are sorted along a space filling curve, nearby objects will be placed closer together in the list. This is true on multiple scales, as well: the first half of the list and the second half of the list should contain groups of primitives with little spatial overlap. This means that after the list has been sorted once, we can produce a tree of acceptable quality by recursively splitting the list up, and putting boxes around its sections.

If we want to replicate this, we need to do something similar. We can use CGAL's existing hilbert_sort method, which arranges primitives along a hilbert curve. Luckily, the AABB tree's solution for dividing primitives between nodes is cleanly separated from the rest of its construction logic. It is provided as a functor withing CGAL's AABB_traits class. The relevant part of that functor looks like this:

template<typename PrimitiveIterator>
void operator()(PrimitiveIterator first,
                PrimitiveIterator beyond,
                const typename AT::Bounding_box &bbox) const {
  PrimitiveIterator middle = first + (beyond - first) / 2;
  switch (Traits::longest_axis(bbox)) {
    case AT::CGAL_AXIS_X: // sort along x
      std::nth_element(first, middle, beyond, boost::bind(Traits::less_x, _1, _2, m_traits));
      break;
    case AT::CGAL_AXIS_Y: // sort along y
      std::nth_element(first, middle, beyond, boost::bind(Traits::less_y, _1, _2, m_traits));
      break;
    case AT::CGAL_AXIS_Z: // sort along z
      std::nth_element(first, middle, beyond, boost::bind(Traits::less_z, _1, _2, m_traits));
      break;
    default:
      CGAL_error();
  }
}

This takes the existing list of primitives and partitions it along whichever axis of the Bbox surrounding them is longest. Because this is called from within a recursive function, it's also applied to sub-lists. One critical difference of the hilbert sort approach is that the list only needs to be sorted once, to sort sub-lists would be redundant. In order to prevent that, we need to include a boolean flag that makes sure the sorting isn't done multiple times. Because this is a functor and not a function, it's easy to add that flag as another member variable. It took a lot of debugging to get the solution to this point, but the end result is shockingly simple:

template<typename PrimitiveIterator>
void operator()(PrimitiveIterator first,
                PrimitiveIterator beyond,
                const typename AT::Bounding_box &bbox) const {
 
  // If this is our first time splitting the primitives, sort them along the hilbert curve
  // This should generally put nearby primitives close together in the list
  if (!has_been_sorted) {
 
    // Create a property map using our Get_reference_point functor
    auto property_map = boost::make_function_property_map<Primitive, Traits::Point_3, Get_reference_point>(
            Get_reference_point(m_traits)
    );
 
    // Our search traits will use that property map
    typedef CGAL::Spatial_sort_traits_adapter_3<Geom_traits, decltype(property_map)> Search_traits_3;
 
    // Perform our hilbert sort using the search traits type with our custom property map
    CGAL::hilbert_sort(first, beyond, Search_traits_3(property_map));
 
    // In the future, it's not necessary to re-sort the primitives (we can blindly partition them in the middle)
    has_been_sorted = true;
  }
}

The most complicated part of this was the Get_reference_point object. The underlying Primitive type might not be a point type. In fact, it might be nothing but a reference to a shared object, or it might have its properties generated on the spot by an outside mechanism. To account for this, we need to have a system that maps each primitive to a representative Point_3 so it can be used in the spatial sort. That's done using the property map type, which requires a unary function:

struct Get_reference_point : public std::unary_function<const Primitive &, typename Traits::Point_3> {
  const Traits &m_traits;
  typedef internal::Primitive_helper<Traits> Helper;
 
  Get_reference_point(const AABB_traits<GeomTraits, AABBPrimitive, BboxMap> &traits)
          : m_traits(traits) {}
 
  typename Traits::Point_3 operator()(const Primitive &p) const {
    return Helper::get_reference_point(p, m_traits);
  }
};

Despite some ugliness, this solution fits very nicely into the AABB tree's existing abstractions.

4.11.5.1 Performance

Based on my understanding of the tradeoffs of a spatial sorting tree construction, I would expect tree construction to become much faster but traversal to become slower. The new construction times look like the following (averaged over 100 constructions of a tree from the larger bunny00.off dataset).

Average Time to Construct a Tree from 37,706 Triangles
KernelRecursive SplittingHilbert Sort
Simple cartesian float0.05812810.0644998
Cartesian float0.09994170.101805
Simple cartesian double0.04252050.0429074
Cartesian double0.08289170.0933127
Epic0.03912750.0417023

These results are a little bit surprising! The hilbert sort approach takes marginally more time to build a tree, and I was able to confirm that this affect is consistent across multiple runs. My suspicion is that because CGAL's hilbert sort doesn't use a radix sort internally, we lose out on a lot of its advantage.

It should be acknowledged that these times are all very small -- dwarfed by the time spent traversing in this benchmark. The results for traversal are even more surprising:

Average Time to Traverse a Tree with 500,000 queries
KernelRecursive SplittingHilbert Sort
Simple cartesian float0.7137110.47457
Cartesian float1.248240.840549
Simple cartesian double0.6301770.446037
Cartesian double1.165260.81038
Epic1.24770.909729

Here, the hilbert sort approach handily outperforms the existing solution! What is going on?

4.11.6 Building a Complete tree with repeated subdivision

In several cases, we've encountered scenarios where it would be beneficial if we could guarantee that our tree was complete. Unfortunately, the current solution for construction the tree doesn't have an obvious mechanism to make this happen. It distributes primitives "shortsightedly", choosing how many go in the left and right subtrees of the current node without any knowledge of the rest of the tree's structure. It's trivial to build a complete tree heuristically, but to do it with only information about the current node is more challenging.

The heuristic it currently uses is to spread the leaves as evenly as possible (for example: 15 primitives --> 8 left, 7 right). Unfortunately, that will often produce non-complete trees. For example, a tree of 6 nodes will look like this:

      N
     / \
    N   N
   /|   |\
  N P   N P
 / \   / \
P   P P   P

But the complete version which contains the same number of leaf nodes would look like this:

      N
     / \
    N   N
   / \  |\
  N   N P P
 /|   |\
P P   P P

Notice that in order to produce a complete tree, we need to make the seemingly-unbalanced choice to distribute 4 of our 6 primitives to the left, rather than 3. What we need is a distribution function that can serve as a drop-in replacement for this function, but produce complete trees.

After thinking through the problem with the help of a whiteboard, we assembled the following formula:

Given a number of primitives to distribute N, our goal is to find the number of primitives that should be distributed to the current node's subtrees in order to produce a complete tree with N leaves (one for each primitive). The solution should give us N_{{L}} and N_{{R}}, the number of primitives that should be distributed to the left and right subtrees, respectively.

s=2^{{d}},{\text{where }}s\leq N

Where s is equivalent to the number of nodes at depth d, and d is the depth of the tree's deepest full layer.

r=N-s

Here, r is the number of remaining nodes if you were to build a tree with s leaves.

N_{{L}}={\frac  {s}{2}}+min({\frac  {s}{2}},r)

The first term of this equation is the base number of leaves that would be in the left subtree of our tree of size s. The second term finds how many need to be added, with min() providing the guarantee that we won't more than double the number of leaves, as that would require going more than one level deeper.

N_{{R}}=N-N_{{L}}

For a 2-way tree, finding N_{{R}} is simple, because the right subtree should contain all the primitives that the left subtree couldn't fit.

Put together, the complete formula looks like the following:

{\begin{aligned}&{\text{Given some }}N\\s&=2^{{d}},\ {\text{where }}s\leq N\\r&=N-s\\N_{{L}}&={\frac  {s}{2}}+min({\frac  {s}{2}},r)\\N_{{R}}&=N-N_{{L}}\\\end{aligned}}

4.11.6.1 Generalizing to higher-order trees

This math works well for a standard 2-way tree, but completeness is just as important for higher order trees, where each node can have more than 2 children. Luckily, the math was relatively simple to generalize.

We're now finding N_{{i}}, the number of primitives in node i, for any i in the set of K children.

{\begin{aligned}&{\text{Given some }}N,K,i,{\text{where }}0\leq i<K\\s&=K^{{d}},\ {\text{where }}s\leq N\\r&=N-s\\r_{{i}}&=max(0,r-(i*{\frac  {s}{k}}))\\N_{{i}}&={\frac  {s}{K}}+min({\frac  {s}{K}},r_{{i}})\\\end{aligned}}

Here, r_{{i}} is the number of remaining primitives left over after extra primitives have already been distributed to nodes 0...i-1. Besides that, this formula functions in much the same way as the 2-way function.

4.11.7 Better benchmarks for Hilbert-sort construction

Our earlier benchmarks for the Hilbert sort tree construction approach produced some unintuitive results. Conventional wisdom is that building a tree this way should be faster than recursive splitting, but that the tree produced will be of lower quality. In benchmarks, this would show as a large improvement in construction speed, coupled with a small penalty to traversal speed. Our tests showed the exact opposite: construction speed was slightly slower, and traversal was faster!

Development of the hilbert sort traits class is now part of a new pull request. As part of this pull request, it's worthwhile to clarify how and where the new technique outperforms the recursive splitting approach. To this end, I created a suite of benchmarks which show the relevant comparisons. The benchmarking code is heavily templated, so it's simple to measure a large cross-section of potential parameters.

(trees are constructed from handle.off, and queries are randomly generated line segments).

CGAL::Simple_cartesian<float>Recursive PartitionHilbert Sort
construction0.000724111 s0.000778169 s
traversal4.4037e-07 s3.9547e-07 s
CGAL::Cartesian<float>Recursive PartitionHilbert Sort
construction0.00129678 s0.00168347 s
traversal5.5959e-07 s4.988e-07 s
CGAL::Simple_cartesian<double>Recursive PartitionHilbert Sort
construction0.000457487 s0.000572949 s
traversal4.2589e-07 s3.7067e-07 s
CGAL::Cartesian<double>Recursive PartitionHilbert Sort
construction0.00112452 s0.0015394 s
traversal5.3615e-07 s4.7219e-07 s
CGAL::EpickRecursive PartitionHilbert Sort
construction0.000462438 s0.000580323 s
traversal5.5827e-07 s4.845e-07 s

This is a lot of data, so in the future, we'll focus primarily on the last test (the one that uses the Epic kernel). Epick is the heaviest kernel, and its results tend to be consistent with the others.


CGAL::EpickRecursive PartitionHilbert Sort
construction0.000462438 s0.000580323 s
traversal5.5827e-07 s4.845e-07 s

Here, construction is a little bit over 20% slower. This is significant, and raises questions considering we actually expected a significant increase in speed. Traversal, on the other hand, is over 15% faster. This is very interesting since we should expect the hilbert-constructed tree to be slower. Both trees produce the same results, which suggests that something might be wrong with the partitioned tree which is preventing it from getting the performance it could.

In any case, if the hilbert sort tree can consistently perform this well, that makes it useful in its own right. There's always the possibility that this performance is the result of a quirk of our input data, so in order to show that that's not the case, we need to test in a variety of different scenarios. For example, using Rays instead of Line segments should result in more intersections, how does that affect performance?

CGAL::EpickRecursive PartitionHilbert Sort
construction0.000458197 s0.00057073 s
traversal5.8002e-07 s5.0278e-07 s

Naturally, construction is almost identical because the tree's data hasn't been changed. The increase in intersections has hurt traversal performance, but this is a flat effect -- this data doesn't change the conclusions we can make.

How about if we were to build a tree from a larger dataset? Before we loaded data from handle.off, containing 1,165 vertices, now we'll use bunny00.off, which contains a massive 37,706 vertices -- over 30x larger! Unlike the previous test, we should expect this to produce the different results for construction times.

CGAL::EpickRecursive PartitionHilbert Sort
construction0.0357437 s0.0420231 s
traversal8.12154e-07 s9.40858e-07 s

Here, we see no change to the relationship between construction times, but our traversal performance has reversed! Traversing the Hilbert sort tree now takes 15% _slower_ on average, rather than faster. My first guess was now that the test is longer, hilbert sort had a disadvantage from running second (once the CPU is already hot). Swapping the order of the tests hardly affected results, showing that this wasn't the case. Perhaps for deeper trees, the optimality of their structure becomes more important to performance.

4.11.8 Improving Hilbert sort construction performance

4.11.8.1 Bottom-up boxing

Hilbert sort isn't constructing as fast as it could, because the bbox is still being produced on the way down. The hilbert sort process involves recursive splitting, so its only advantage comes from the fact that the sorting process doesn't require knowledge of bounding boxes. Because of this, the boxes can be produced from the bottom up more cheaply (because each box only depends on child boxes, and not on the list of primitives it contains).

We can create our boxes from the bottom up with the addition of a template specialization that treats the hilbert traits class differently. Once this is added, we see a big improvement in construction speed, and the benchmark starts to look a lot more how we expected.

CGAL::EpickRecursive PartitionHilbert Sort
construction0.0297283 s0.0230078 s
traversal6.85527e-07 s8.32478e-07 s

Construction is now about 30% better, and traversal is around 20% worse.

4.11.8.2 Multithreading

We can get a further speedup by enabling a multithreaded sort of the primitives, a feature provided by CGAL's hilbert sort implementation.

CGAL::EpickRecursive PartitionHilbert Sort
construction0.0289884 s0.00990184 s
traversal6.75906e-07 s7.76014e-07 s

This gives the hilbert sort an even larger advantage, though it's not entirely fair -- the recursive partition method doesn't incorporate even limited multithreading. Hilbert traits can be used to construct a tree nearly 3 times as fast as the typical method, which can make its tradeoffs very appealing.

4.11.9 More challenging benchmarks

The tweaks we made to the new traits class have given it demonstrable advantages over the traditional method, but in order to reason about when it's appropriate we need to see how it performs in extreme use cases. To that end, we created some more difficult benchmarks.

4.11.9.1 Larger dataset

Andreas provided a .PLY file with a high resolution scan of a gargoyle. Its mesh is made up of over 7,000,000 vertices, so building a tree for it is an expensive operation.

CGAL::EpickRecursive PartitionHilbert Sort
construction0.705003 s0.441753 s
traversal6.02128e-06 s7.83255e-06 s

The new parallel traits class maintains a lot of its advantage, remaining around 40% the traditional method. The size of the tree also had little effect on the relative cost of traversals, with the lower quality tree still requiring under 20% more time to traverse.

4.11.9.2 Real-life use case

The fact that the new construction can be used simply by swapping the traits type opens up a lot of opportunities for interesting real-life tests. This makes it easy to look at how the tradeoffs we've been measuring affect actual use-cases. Surface mesh segmentation is a perfect worst-case benchmark for this optimization, for a few reasons:

  • Because the underlying primitives aren't changed, the tree only needs to be constructed a single time.
  • That one tree is used for a large number of queries (one for each vertex in the output mesh).
  • Each query is guaranteed to fully descend the tree. This is because the queries are rays shot from the inside of a watertight mesh, so they're guaranteed to hit a face of that mesh. As a result of this, ever traversal is very expensive.

We benchmarked segmentation by timing segmentation_from_sdf_values_SM_example. The package subclasses AABB_traits, so we could swap strategies easily by replacing what it subclassed.

SegmentationRecursive PartitionHilbert Sort
time11.0 s12.7 s

Here, we see that for this use case, the new technique isn't worth its tradeoffs.

4.12 August 11 - August 17

4.12.1 Quantifying tradeoffs

So when is it actually worth it to use the new traits class? We can actually determine this from the performance data we've already collected, we'll look at the largest test for this:

CGAL::EpickRecursive PartitionHilbert Sort
construction0.705003 s0.441753 s
traversal6.02128e-06 s7.83255e-06 s

The new traits class reduced the time to construct a tree by 0.263 s and increased the time to perform one traversal by 0.00000181 s. Taken together, this tells us that the new traits class loses its advantage only once you've performed 145,201 traversals. If you do more than that, then it becomes worthwhile to spend more time to build a higher quality tree, so your traversals can be faster. I expect this number varies quite a bit depending on the exact nature of the data, but the order of magnitude difference should be indicative.

Of course, this result is specific to the size of the tree we built! Construction time and traversal time both vary non-linearly with the number of primitives in the tree, in order to get a better idea of how tree size affects our tradeoffs, we need to perform benchmarks at different scales. I created a new benchmarking approach which progressively shrinks the dataset, and measured the effect that had on construction and traversal times.

CGAL::Epick
# PrimitivesRecursive Partition ConstructionHilbert Sort ConstructionRecursive Partition TraversalHilbert Sort Traversal
16375300.690989 s0.437864 s5.82817e-06 s7.5607e-06 s
10000000.481111 s0.258386 s6.25142e-06 s7.0905e-06 s
1000000.0604657 s0.0258631 s6.44138e-06 s6.54842e-06 s
100000.00545731 s0.0029774 s5.30458e-06 s5.58388e-06 s
10000.00149441 s0.00137508 s3.64947e-06 s3.67336e-06 s

From here it's easy to calculate breakeven points for each order of magnitude:

CGAL::Epick
# PrimitivesTraversals-per-construction
1637530146,000
1000000265,000
100000323,000
100008,880
10004,990

The relationship isn't actually monotonically increasing or decreasing. It looks as though fast-construction is most effective at somewhere around 100,000 primitives. As the tree gets larger, the utility of the new construction strategy declines gradually, and when the tree is smaller it loses its advantage rapidly. For a small 1000-primitive tree, you only need to do over 5,000 traversals before fast-construction is no longer worthwhile!

4.12.2 More diverse test data

To evaluate how the nature of the data affects the new construction strategy's advantages, I standardized the testing procedure and created synthetic datasets with Andreas' help.

4.12.2.1 Gargoyle

I converted the gargoyle dataset to .off, and rather than simplifying the mesh I randomly removed a subset of its faces. This change to how the number of primitives is reduced is actually significant, because it makes the mesh more sparse in addition to reducing the size of the tree. This will generally result in decreasing hit rates as the dataset is reduced, but because the change is being applied uniformly across our tests this should be acceptable.

CGAL::Epick, gargoyle_1637530.off
# PrimitivesRecursive Partition ConstructionHilbert Sort ConstructionRecursive Partition TraversalHilbert Sort Traversal
16375300.68504 s0.434351 s6.0408e-06 s7.48615e-06 s
10000000.411266 s0.207592 s5.13054e-06 s5.81011e-06 s
1000000.0292015 s0.0108758 s1.10307e-06 s1.4085e-06 s
100000.0024164 s0.00109091 s4.99561e-07 s6.05938e-07 s
10000.00018549 s0.000163817 s3.6545e-07 s2.94521e-07 s

These results mostly look similar to the previous measurements, but things begin to diverge as the number of primitives gets very low. Because the mesh is much more sparse, fewer of the primitives descend the tree as far. This makes the effect of a higher quality tree less visible, and the Hilbert-constructed tree even beats the conventional one when there are only 1,000 elements. Luckily, we still get good data for larger trees, and the results mirror the previous test as expected.

CGAL::Epick, gargoyle_1637530.off
# PrimitivesTraversals-per-construction
1637530173,445
1000000299,710
10000060,000
1000012,460
1000(inf)

When we use the same formula to find our breakeven points, we get mostly reasonable results. The numbers tend to fall within the same orders of magnitude as in the previous test.

4.12.2.2 Rotated Sphere-grid

The dataset for the next test was synthesized using the Polyhedron demo. It consists of an 8x8x8 cubic grid of sphere meshes with an affine transform applied so that they aren't perfectly on axis. The gargoyle mesh formed a "shell" which the rays were cast outwards from, this mesh is very different from that, and that may have consequences for our performance.

  • The rays are shot from outside the spheres, which means that they aren't guaranteed to hit any faces.
  • The spheres are densely distributed around the source, rays are more likely to hit many faces, rather than just one.
CGAL::Epick, sphere_grid_842400.off
# PrimitivesRecursive Partition ConstructionHilbert Sort ConstructionRecursive Partition TraversalHilbert Sort Traversal
8424000.425959 s0.186575 s1.04627e-05 s1.15029e-05 s
1000000.0369721 s0.0111089 s3.20296e-06 s3.64699e-06 s
100000.00260251 s0.00106912 s3.8749e-07 s4.28021e-07 s
10000.000217795 s0.000175691 s5.56993e-08 s5.78713e-08 s

At a glance, these results seem to show a larger advantage for the Hilbert sort, after all, the construction is more than twice as fast and traversal is only slightly slower. This may not have as big an affect on the breakeven point as it seems at first: for tests on this dataset, construction was especially fast and traversal was especially slow for either traits class. Practically speaking, this means that the value of a relatively faster construction is lower, because times are more likely to be driven by the number of traversals.

CGAL::Epick, sphere_grid_842400.off
# PrimitivesTraversals-per-construction
842400230,133
10000058,247
1000037,833
100019,385

It seems as though the two opposing effects mostly cancel out. The breakeven point tends to be within the same orders of magnitude as in previous tests.

4.12.2.3 Mixed Sphere-grid

Next, we produced another dataset by merging the affine-transformed sphere grid with the axis-aligned pre-transformation version. Besides increasing primitive count, this also has the effect of increasing overall density and reducing the likelihood that a ray will hit nothing.

CGAL::Epick, sphere_grid_combined_1684800.off
# PrimitivesRecursive Partition ConstructionHilbert Sort ConstructionRecursive Partition TraversalHilbert Sort Traversal
16848000.924095 s0.488123 s2.49734e-05 s2.78055e-05 s
10000000.508275 s0.217452 s2.00373e-05 s2.10856e-05 s
1000000.0351812 s0.0113362 s8.32743e-06 s9.66192e-06 s
100000.00257242 s0.00102682 s4.33653e-06 s4.68058e-06 s
10000.000201988 s0.000168705 s1.48951e-06 s1.46143e-06 s

This test exhibits the same issue as the first, where for small dataset sizes the traversal data is potentially unreliable, and the Hilbert constructed tree beats the conventional method.

CGAL::Epick, sphere_grid_combined_1684800.off
# PrimitivesTraversals-per-construction
1684800153939
1000000277423
10000017868
100004492
1000(inf)

Once again, we see results with the same magnitude as the other tests.

4.12.3 Switching from Median to Middle Hilbert Policy

The Hilbert sorting package provides two different strategies, the one we've been using so far splits based on the median of the points, but we might be able to get better results by switching to the "middle" policy.

4.12.3.1 Gargoyle

CGAL::Epick, gargoyle_1637530.off
# PrimitivesRecursive Partition ConstructionHilbert Sort ConstructionRecursive Partition TraversalHilbert Sort Traversal
16375300.694047 s0.438711 s5.88445e-06 s7.44739e-06 s
10000000.406627 s0.204171 s5.05238e-06 s5.77835e-06 s
1000000.0294115 s0.01054 s1.0871e-06 s1.45127e-06 s
100000.00239611 s0.00103612 s5.37958e-07 s6.09279e-07 s
10000.0001899 s0.0001647 s3.39e-07 s2.95038e-07 s
CGAL::Epick, gargoyle_1637530.off
# PrimitivesTraversals-per-construction
1637530163,369
1000000278,877
10000051,821
1000019,069
1000(inf)

For this data, performance is consistently slightly worse, meaning that the breakeven point is lower.

4.12.3.2 Rotated Sphere-grid

CGAL::Epick, sphere_grid_842400.off
# PrimitivesRecursive Partition ConstructionHilbert Sort ConstructionRecursive Partition TraversalHilbert Sort Traversal
8424000.419961 s0.187204 s1.03394e-05 s1.13945e-05 s
1000000.0360399 s0.0110656 s3.34002e-06 s3.62142e-06 s
100000.00264938 s0.00108409 s3.97501e-07 s4.16641e-07 s
10000.000221205 s0.000191903 s5.85699e-08 s6.01912e-08 s
CGAL::Epick, sphere_grid_842400.off
# PrimitivesTraversals-per-construction
842400220,602
10000088,750
1000081,781
100018,073

Here, things are a little bit more interesting. The new strategy is slightly better when the number of primitives is small, but performs worse for the largest set.

4.12.3.3 Mixed Sphere-grid

CGAL::Epick, sphere_grid_combined_1684800.off
# PrimitivesRecursive Partition ConstructionHilbert Sort ConstructionRecursive Partition TraversalHilbert Sort Traversal
16848000.920223 s0.486563 s2.50497e-05 s2.79779e-05 s
10000000.505539 s0.215818 s2.01638e-05 s2.11468e-05 s
1000000.035145 s0.0111322 s8.34144e-06 s9.48135e-06 s
100000.00264618 s0.0010278 s4.33139e-06 s4.66359e-06 s
10000.000208092 s0.000167298 s1.52239e-06 s1.53896e-06 s
CGAL::Epick, sphere_grid_combined_1684800.off
# PrimitivesTraversals-per-construction
1684800148,098
1000000294,731
10000021,066
100004,872
10002,462

Here, the difference is less consistent, with the new strategy winning some and losing others. These results aren't reliable enough in my eyes to justify switching to the middle-strategy as a default.

4.12.4 More exploration of Implicit Tree Structures

The implicit tree structure I previously created was based on my earlier n-way-tree work, but in order to have an atomic PR, I need to build the same functionality on top of master. An important luxury I have when working on this is that the AABB tree's node class is fully internal, and it isn't mentioned in the documentation. This means that I have few restrictions on changing its API, and I can make some relatively drastic changes to how it works without breaking compatibility with users' code.

Unlike last time, I decided to base this implementation on the concept of a "Node handle". The node type won't contain any direct information about the state of the node it represents, instead, it acts like an elaborate iterator type. To do this, it needs to contain three things:

  • std::vector<Bounding_box> &m_boxes a reference to a collection of boxes
  • std::vector<Primitive> &m_primitives a reference to a collection of primitives
  • std::size_t m_index the index of the relevant node

The node's public API provides a superset of the functionality of the original node. Here's a summarized version of its public methods:

  class AABB_node {
  public:
 
    AABB_node(std::vector<Bounding_box> &boxes, std::vector<Primitive> &primitives, std::size_t index = 0);
 
    void expand(ConstPrimitiveIterator first,
                ConstPrimitiveIterator beyond,
                const ComputeBbox &compute_bbox,
                const SplitPrimitives &split_primitives,
                const AABBTraits &);
 
    void traversal(const Query &query, Traversal_traits &traits) const;
 
  public:
 
    // Existing functionality
    inline const Bounding_box &bbox();
    inline Bounding_box &bbox();
    inline const Primitive &primitive() const;
    inline Primitive &primitive();
    inline Node left_child() const;
    inline Node right_child() const;
 
    // New functionality
    inline bool is_leaf() const;
  }

A lot of the functionality of the original node has been renamed, and the way it's implemented has changed completely:

  • bbox() provides access to the node's associated bbox, using the node's index and its reference to the bbox collection.
  • primitive() is similar, though some math is necesary to determine the index of the primitive in m_primitives.
  • left_child() and right_child() provide access to children by constructing them on-the-spot, since the node handles don't actually exist in an array
  • expand() uses a similar recursive algorithm to the original, and it was actually simplified by the fact that the connections between nodes already exist implicitly
  • traversal() was also simplified, because it's now easier to determine if a node is a leaf node

The additional context that the node handle has also makes it easy to add new functionality, which the original tree couldn't provide. For example is_leaf() as mentioned earlier, was as simple as comparing the node's index to the number of non-leaf nodes. There's also an opportunity to add additional functionality in the future, like the following:

  • parent() can find the parent using a simple equation of the index
  • children() can be found as a list, because they are always adjacent
  • siblings() can be done similarly

Changes to the tree itself are much smaller, and don't affect its API. Importantly, the tree's node array is replaced with a bbox vector. The tree also holds an instance of the root node (index 0). This is only necessary so that the root accessor can still return a pointer, otherwise it would have made more sense to construct this node on request, like all other nodes.

4.12.4.1 Use of Assertions

One feature that was central to my development process was the way the node handle class uses assertion statements. I applied assertions liberally, the implicit nature of the tree would have made it hard to catch certain types of errors otherwise. Often, I used assertions to validate assumptions rather than as preconditions. For example, here's one way assertions are used during the construction of a tree:

    ...
 
    // Stop when we've reached a leaf node
    if (is_leaf()) {
 
      // Make sure we're only trying to assign a single primitive to this node
      CGAL_assertion_msg(
              beyond - first == 1,
              "Constructor attempted to assign more than one primitive to a leaf node"
      );
 
      // Make sure the primitives table is aligned with the bbox table
      CGAL_assertion_msg(
              first - m_primitives.get().begin() == primitive_index(),
              "Primitive misalignment! "
              "The tree's topology should guarantee that primitives are aligned with their boxes."
      );
 
    } else {
 
      ...

Because of the way our implicit construction works, the recursive method has an empty base case. When we reach a leaf node, it should already have all of its connections, and it should be associated with its primitive based on its index. Instead of eliminating this space entirely, I used it to check that the node was imbued with the right properties. I check that the leaf node is being associated with exactly one primitive, as well as that the leaf node can correctly locate the primitive it's associated with.

4.12.5 "Fully boxed" vs "Leafless" trees

A topic that's already come up several times is whether each primitive should have its own bounding box. Here, I'll refer to the options as "Leafless" and "Fully boxed".

  • A Leafless tree does not put boxes around each primitive; instead, the deepest boxes in the tree surround pairs of primitives. I'm referring to this as "leafless" because it doesn't involve any nodes that don't hold more than one child. This is how CGAL's existing AABB tree works, and it comes with a couple of implications:
    • The tree needs fewer nodes to hold the same number of primitives.
    • A node can hold either two primitives, two child nodes, or a primitive and a node.
    • Traversal and construction are more complicated, to account for the different states a node can have.
  • A Fully-boxed tree gives every primitive its own node. This is how the current version of the N-way tree works as well as the implicit tree, it has its own implications.
    • Nearly twice as many nodes are necessary to hold the same number of primitives.
    • A node can hold either a primitive or a pair of leaves, with no in-between.
    • Child nodes can be held in contiguous arrays, since they aren't mixed in with primitives.
    • Recursive traversal and construction are trivial and intuitive, because nodes can only be in two different states, with leaf nodes being the base-case.
    • Performance tends to be better by default, because the last bbox acts as a geometric filter for primitives that don't already have a bbox pre-check for intersections.

The AABB tree supports a Bbox-map which saves the result of finding the bbox for a primitive. This might be able to provide the same performance of the fully-boxed tree without needing the same amount of additional space. In order to look at the tradeoffs holistically, it's necessary to create a leafless version of our implicit tree.

4.12.5.1 Leafless Implicit Tree

The implicit tree was relatively easy to convert to a leafless structure, using the following changes:

  • The box vector was reduced in size, to only include the ones necessary for this configuration.
  • The constructor was changed to only set the bbox of non-leaf nodes (attempts to set it would otherwise cause out-of-bounds errors.
  • The math used to find the primitive for a node was changed to account for the new number of boxes.
  • Traversal needed additional logic to prevent attempting to intersect with the boxes of leaf-nodes.

Conceptually, the change to our "table" converts it from this:

   B B B B B B B B B B B B B
               P P P P P P P

to this:

   B B B B B B
               P P P P P P P

The way we use indices in our node handles remains the same, the only difference being that certain indices don't have boxes associated with them.

4.12.6 Benchmarking tree configurations

(Testing here is done with the Epic Kernel and gargoyle_1637530.off dataset)

First, we want to see if taking the same tree topology and making it implicit hurts performance, to do this we can look at how the leafless versions of both trees compare.

PerformanceMaster (leafless)Implicit (leafless)
Construction Time0.941639 s0.97235 s
Traversal Time5.76427e-06 s6.60947e-06 s
Memory Usage404.0MiB352.8 MiB

Here, the conversion to implicit comes with a small but measurable performance penalty for both construction and traversal. The slowdown to traversal is not surprising, as I haven't made any optimizations to retrieving properties of nodes (many redundant array accesses happen during traversal). The construction slowdown is, on the other hand, surprising. A lot of logic was removed from the construction code, so I'm uncertain where this penalty comes from.

Memory usage goes down, but not by very much. The primitives themselves are likely responsible for a much larger portion of the memory used than the tree's structure, since bounding boxes and pointers are relatively small compared to high-precision coordinate types. Moreover, even for the massive dataset used here, memory is far from the limiting factor.

We also want to see whether using a bbox-map can help us get better performance out of a leafless tree. For this we can compare the fully-boxed implicit tree to both leafless implementations with the addition of the bbox map. The N-way tree is also included in this benchmark as an example of a fully-boxed architecture that uses explicit connections between nodes.

PerformanceMaster (leafless + bbox-map)Implicit (leafless + bbox-map)Implicit (fully-boxed)N-way (fully-boxed)
Construction Time0.758121 s0.840891 s0.960223 s0.672876 s
Traversal Time5.78328e-06 s6.59443e-06 s5.64262e-06 s5.40254e-06 s
Memory Usage452.7 MiB427.7 MiB446.0 MiB479.0 MiB

5 Proposed Optimizations

Potential Approaches for Improving AABB-tree performance (using SIMD)
ApproachPredicted YieldIntrusivenessDescription
N-Way SplittingHighHighWide BVH like Embree uses, each node holds N >= 2 children, where N is the # of SIMD lanes.
Parents hold child boundsMediumLowRather than each node containing its own boundary, nodes can hold the bounding boxes of each of their children. This improves the arrangement in memory, potentially allowing easier vectorization.
Child-skipping ?MediumIf a node contains a complete tree to depth N, gather all (great)grandchildren at depth N and perform intersections with them (SIMD allows us to perform intersections on multiple nodes without much additional cost).
SIMD Intersection FuncLowLowFine grained vectorization of the intersection algorithm, no changes outside the function itself. This is likely already being done by the compiler.
Data AlignmentLowMediumMake sure that the AABB-tree's data is stored in memory-aligned data structures, for faster loading into SIMD registers.
Stream TraversalMediumHighQueries composed of multiple similar rays, improving cache performance and making use of SIMD for simultaneous intersections.
Query Boxing ?MediumPlace the query in a BBox or set of BBoxes, to avoid the need to use high-precision intersection approaches until intersecting with primitives.
Subtree Collapsing*MediumMediumLeaf nodes have a "bucket size", can hold more than 2 primitives.
Pointer-less nodes*HighHighUse index or pointer math to access children, so that each node doesn't need to store a reference
Spatial Sort Construction*HighLowBuild the tree quickly by sorting the primitives along the Hilbert curve.

* Certain optimizations may not result in increased use of SIMD instructions.

6 Queue

Upcoming work:

  • Add description column to tree comparison table
  • Sign INRIA copyright transfer
  • Examine Ray-BBox intersection implementation
    • Determine how SIMD may be applied.
    • Extract real examples of ray-bbox intersections from tetrahedral remeshing examples, for use in benchmarking.
    • Create benchmark from real examples
    • Add a ray-bbox intersection implementation using xsimd
      • Cache broadcasted ray type, in case repeated broadcast scalar operations are hurting performance.
  • Move experimental work to a branch on cgal-public-dev
  • Consider appropriate SIMD libraries
  • Create a batched-bbox memory arrangement to compare with array-of-structs and struct-of-arrays arrangements.
  • Assemble slides to explain the current progress and central challenge of the project.
  • Convert CGAL's existing intersection function to branchless, to see how that interacts with the child-skipping optimization
  • Determine the reason CGAL's (unaltered) Mesh code examples fail on my device
  • Profile provided benchmarks
    • Goal is to discover whether most time is spent traversing the tree (bbox intersections), or directly testing for intersections with primitives
  • Examine BBox-BBox intersection implementation
    • Create a simd-friendly version
  • Benchmark BBox-BBox vs Ray-BBox vs Ray-Primitive [triangle] intersection.
  • Benchmark Ray-BBox vs Boxed-ray-BBox intersection
  • Create an N-way branch on cgal-public-dev
  • Document the advantages of 1-node-per-primitive
  • Determine a breakeven point for hilbert sort construction
    • Produce plots which demonstrate the breakeven point
  • Add concurrency parameter to new traits class
  • Move expand's template specialization so it's not inline
  • Add doxygen documentation for new traits class
    • API
    • Examples
    • Information about tradeoffs
  • Benchmark different types of implicit trees