Skip to content

Sorting Module

The sorting module hosts the GPU radix sorter that orders projected splats by depth before they are drawn. It owns the buffers, bind groups, and compute pipelines needed to transform the unsorted output from preprocessing into a contiguous list of payload indices that the renderer can feed to an indirect draw call. Everything lives under src/sort/ and is shared between the preprocessor and the renderer.

Overview

Item Path Purpose
Interfaces src/sort/index.ts ISorter, SortedSplats, PointCloudSortStuff contracts used across the renderer stack.
Implementation src/sort/radix_sort.ts GPURSSorter class plus buffer allocation helpers.
Shader src/shaders/radix_sort.wgsl Four compute entry points (zero_histograms, calculate_histogram, prefix_histogram, scatter_even / scatter_odd).

Responsibilities

  • Allocate and recycle the ping pong key and payload buffers used during radix passes.
  • Expose bind group layouts so preprocessing (group 2) and rendering (group 1) can bind sorter resources without duplicating layouts.
  • Record either fixed size sorts (recordSort) or indirect sorts driven by the counters that preprocessing filled.
  • Maintain the GeneralInfo uniform buffer (keys_size, padded_size, pass flags) and the indirect dispatch buffer shared between preprocessing, sorting, and rendering.
  • Provide helper APIs (createSortStuff, recordResetIndirectBuffer) so the renderer can keep per model caches in sync with the number of visible splats.

Data Flow

Preprocessor (depth keys)
        |
        | writes key_a, payload_a, keys_size, dispatch_x
        v
PointCloudSortStuff (key_a, payload_a, sorter_uni, sorter_dis)
        |
        | consumed by GPURSSorter passes
        v
Zero -> Histogram -> Prefix -> Scatter
        |
        v
Renderer (binds sorter_render_bg, issues indirect draw)

Buffer and Uniform Snapshot

Buffer Source Usage Notes
key_a / key_b Sorter STORAGE Ping pong depth keys (u32).
payload_a / payload_b Sorter STORAGE Ping pong payload indices; payload_a doubles as the renderer sorted index buffer.
internal_mem Sorter STORAGE Histograms, partitions, and lookback metadata for WGSL.
sorter_uni Sorter STORAGE GeneralInfo struct (keys_size, padded_size, pass selectors). Updated by renderer and preprocessing.
sorter_dis Sorter STORAGE / INDIRECT / COPY_DST Stores dispatch_x/y/z for indirect histogram and scatter passes plus draw calls.
sorter_bg_pre Sorter Bind group Exposes sorter_uni, key_a, payload_a, sorter_dis to preprocessing (@group(2)).
sorter_render_bg Sorter Bind group Exposes sorter_uni and payload_a to the renderer (@group(1)).

PointCloudSortStuff bundles the resources above so the renderer can keep a cache per point cloud or per global buffer.

GPU Pipeline (4 passes)

  1. Zero pass: clears histograms and shared memory.
  2. Histogram pass: each workgroup (256 threads x 15 rows) scans 3840 keys, extracts eight bit digits for all four radix passes, and atomically accumulates counts.
  3. Prefix pass: a 128 thread workgroup performs an exclusive scan over the 256 entry histograms to produce global offsets.
  4. Scatter passes: two entry points (scatter_even, scatter_odd) handle passes {0,2} and {1,3}, using ping pong buffers to avoid hazards. Each thread computes a local rank, adds the global offset, and writes the new key and payload order.

Key constants: - Eight bit radix -> four passes for 32 bit keys. - Workgroup sizes: histogram and scatter = 256, prefix = 128. - RS_HISTOGRAM_BLOCK_ROWS = 15 so each workgroup touches 3840 keys. - All derived constants are baked into the WGSL in processShaderTemplate to keep padding and shared memory sizes in sync with the CPU helpers.

Integration Points

  • Preprocessing binds sorter_bg_pre as @group(2) to write depth keys, payload indices, keys_size, and indirect dispatch counts. Keys are padded to the next multiple of 256 * 15 so the histogram pass can iterate without bounds checks.
  • Renderer caches one PointCloudSortStuff per point cloud. prepareMulti resets the indirect buffer (recordResetIndirectBuffer), computes base offsets, runs preprocessing, then calls recordSortIndirect once using the dispatch data that preprocessing produced. The render pass binds sorter_render_bg to read sorted indices and instance counts.
  • Dynamic point counts: when ONNX pipelines vary the number of visible splats, preprocessing overwrites GeneralInfo.keys_size and sorter_dis.dispatch_x. The sorter indirect path consumes those values directly, so no CPU read back is required.
  • Testing: GPURSSorter.create tries several subgroup sizes (16 -> 32 -> 16 -> 8 -> 1) and runs testSort (sorting 8192 floats) to ensure the compiled pipelines work on the current adapter before exposing the sorter to the renderer.

Usage Example

import { GPURSSorter } from 'src/sort';

const sorter = await GPURSSorter.create(device, device.queue);
const sortStuff = sorter.createSortStuff(device, pointCloud.numPoints);
sortResourcesCache.set(pointCloud, sortStuff);

// Before preprocessing each frame
sorter.recordResetIndirectBuffer(sortStuff.sorter_dis, sortStuff.sorter_uni, device.queue);

// ... preprocessing writes depth keys and payloads into sortStuff ...

const encoder = device.createCommandEncoder();
sorter.recordSortIndirect(sortStuff, sortStuff.sorter_dis, encoder);
device.queue.submit([encoder.finish()]);

// Render pass
pass.setBindGroup(1, sortStuff.sorter_render_bg);
pass.drawIndirect(drawIndirectBuffer, 0);
  • Architecture – Workgroup coordination, LDS layout, and indirect dispatch sequencing.
  • API Reference – Exported sorter types, creation helpers, and command encoders.
  • Preprocess Module – Describes how depth keys and payloads are generated before sorting.
  • Renderer Module – Shows how sorted indirect buffers drive the draw passes.