Sorting Module
The sorting module hosts the GPU radix sorter that orders projected splats by depth before they are drawn. It owns the buffers, bind groups, and compute pipelines needed to transform the unsorted output from preprocessing into a contiguous list of payload indices that the renderer can feed to an indirect draw call. Everything lives under src/sort/ and is shared between the preprocessor and the renderer.
Overview
| Item | Path | Purpose |
|---|---|---|
| Interfaces | src/sort/index.ts |
ISorter, SortedSplats, PointCloudSortStuff contracts used across the renderer stack. |
| Implementation | src/sort/radix_sort.ts |
GPURSSorter class plus buffer allocation helpers. |
| Shader | src/shaders/radix_sort.wgsl |
Four compute entry points (zero_histograms, calculate_histogram, prefix_histogram, scatter_even / scatter_odd). |
Responsibilities
- Allocate and recycle the ping pong key and payload buffers used during radix passes.
- Expose bind group layouts so preprocessing (group 2) and rendering (group 1) can bind sorter resources without duplicating layouts.
- Record either fixed size sorts (
recordSort) or indirect sorts driven by the counters that preprocessing filled. - Maintain the
GeneralInfouniform buffer (keys_size,padded_size, pass flags) and the indirect dispatch buffer shared between preprocessing, sorting, and rendering. - Provide helper APIs (
createSortStuff,recordResetIndirectBuffer) so the renderer can keep per model caches in sync with the number of visible splats.
Data Flow
Preprocessor (depth keys)
|
| writes key_a, payload_a, keys_size, dispatch_x
v
PointCloudSortStuff (key_a, payload_a, sorter_uni, sorter_dis)
|
| consumed by GPURSSorter passes
v
Zero -> Histogram -> Prefix -> Scatter
|
v
Renderer (binds sorter_render_bg, issues indirect draw)
Buffer and Uniform Snapshot
| Buffer | Source | Usage | Notes |
|---|---|---|---|
key_a / key_b |
Sorter | STORAGE | Ping pong depth keys (u32). |
payload_a / payload_b |
Sorter | STORAGE | Ping pong payload indices; payload_a doubles as the renderer sorted index buffer. |
internal_mem |
Sorter | STORAGE | Histograms, partitions, and lookback metadata for WGSL. |
sorter_uni |
Sorter | STORAGE | GeneralInfo struct (keys_size, padded_size, pass selectors). Updated by renderer and preprocessing. |
sorter_dis |
Sorter | STORAGE / INDIRECT / COPY_DST | Stores dispatch_x/y/z for indirect histogram and scatter passes plus draw calls. |
sorter_bg_pre |
Sorter | Bind group | Exposes sorter_uni, key_a, payload_a, sorter_dis to preprocessing (@group(2)). |
sorter_render_bg |
Sorter | Bind group | Exposes sorter_uni and payload_a to the renderer (@group(1)). |
PointCloudSortStuff bundles the resources above so the renderer can keep a cache per point cloud or per global buffer.
GPU Pipeline (4 passes)
- Zero pass: clears histograms and shared memory.
- Histogram pass: each workgroup (256 threads x 15 rows) scans 3840 keys, extracts eight bit digits for all four radix passes, and atomically accumulates counts.
- Prefix pass: a 128 thread workgroup performs an exclusive scan over the 256 entry histograms to produce global offsets.
- Scatter passes: two entry points (
scatter_even,scatter_odd) handle passes {0,2} and {1,3}, using ping pong buffers to avoid hazards. Each thread computes a local rank, adds the global offset, and writes the new key and payload order.
Key constants:
- Eight bit radix -> four passes for 32 bit keys.
- Workgroup sizes: histogram and scatter = 256, prefix = 128.
- RS_HISTOGRAM_BLOCK_ROWS = 15 so each workgroup touches 3840 keys.
- All derived constants are baked into the WGSL in processShaderTemplate to keep padding and shared memory sizes in sync with the CPU helpers.
Integration Points
- Preprocessing binds
sorter_bg_preas@group(2)to write depth keys, payload indices,keys_size, and indirect dispatch counts. Keys are padded to the next multiple of256 * 15so the histogram pass can iterate without bounds checks. - Renderer caches one
PointCloudSortStuffper point cloud.prepareMultiresets the indirect buffer (recordResetIndirectBuffer), computes base offsets, runs preprocessing, then callsrecordSortIndirectonce using the dispatch data that preprocessing produced. The render pass bindssorter_render_bgto read sorted indices and instance counts. - Dynamic point counts: when ONNX pipelines vary the number of visible splats, preprocessing overwrites
GeneralInfo.keys_sizeandsorter_dis.dispatch_x. The sorter indirect path consumes those values directly, so no CPU read back is required. - Testing:
GPURSSorter.createtries several subgroup sizes (16 -> 32 -> 16 -> 8 -> 1) and runstestSort(sorting 8192 floats) to ensure the compiled pipelines work on the current adapter before exposing the sorter to the renderer.
Usage Example
import { GPURSSorter } from 'src/sort';
const sorter = await GPURSSorter.create(device, device.queue);
const sortStuff = sorter.createSortStuff(device, pointCloud.numPoints);
sortResourcesCache.set(pointCloud, sortStuff);
// Before preprocessing each frame
sorter.recordResetIndirectBuffer(sortStuff.sorter_dis, sortStuff.sorter_uni, device.queue);
// ... preprocessing writes depth keys and payloads into sortStuff ...
const encoder = device.createCommandEncoder();
sorter.recordSortIndirect(sortStuff, sortStuff.sorter_dis, encoder);
device.queue.submit([encoder.finish()]);
// Render pass
pass.setBindGroup(1, sortStuff.sorter_render_bg);
pass.drawIndirect(drawIndirectBuffer, 0);
Related Docs
- Architecture – Workgroup coordination, LDS layout, and indirect dispatch sequencing.
- API Reference – Exported sorter types, creation helpers, and command encoders.
- Preprocess Module – Describes how depth keys and payloads are generated before sorting.
- Renderer Module – Shows how sorted indirect buffers drive the draw passes.