Sorting Module

The sorting module hosts the GPU radix sorter that orders projected splats by depth before they are drawn. It owns the buffers, bind groups, and compute pipelines needed to transform the unsorted output from preprocessing into a contiguous list of payload indices that the renderer can feed to an indirect draw call. Everything lives under src/sort/ and is shared between the preprocessor and the renderer.

Overview

Item	Path	Purpose
Interfaces	`src/sort/index.ts`	`ISorter`, `SortedSplats`, `PointCloudSortStuff` contracts used across the renderer stack.
Implementation	`src/sort/radix_sort.ts`	`GPURSSorter` class plus buffer allocation helpers.
Shader	`src/shaders/radix_sort.wgsl`	Four compute entry points (`zero_histograms`, `calculate_histogram`, `prefix_histogram`, `scatter_even` / `scatter_odd`).

Responsibilities

Allocate and recycle the ping pong key and payload buffers used during radix passes.
Expose bind group layouts so preprocessing (group 2) and rendering (group 1) can bind sorter resources without duplicating layouts.
Record either fixed size sorts (recordSort) or indirect sorts driven by the counters that preprocessing filled.
Maintain the GeneralInfo uniform buffer (keys_size, padded_size, pass flags) and the indirect dispatch buffer shared between preprocessing, sorting, and rendering.
Provide helper APIs (createSortStuff, recordResetIndirectBuffer) so the renderer can keep per model caches in sync with the number of visible splats.

Data Flow

Preprocessor (depth keys)
        |
        | writes key_a, payload_a, keys_size, dispatch_x
        v
PointCloudSortStuff (key_a, payload_a, sorter_uni, sorter_dis)
        |
        | consumed by GPURSSorter passes
        v
Zero -> Histogram -> Prefix -> Scatter
        |
        v
Renderer (binds sorter_render_bg, issues indirect draw)

Buffer and Uniform Snapshot

Buffer	Source	Usage	Notes
`key_a` / `key_b`	Sorter	STORAGE	Ping pong depth keys (u32).
`payload_a` / `payload_b`	Sorter	STORAGE	Ping pong payload indices; `payload_a` doubles as the renderer sorted index buffer.
`internal_mem`	Sorter	STORAGE	Histograms, partitions, and lookback metadata for WGSL.
`sorter_uni`	Sorter	STORAGE	`GeneralInfo` struct (`keys_size`, `padded_size`, pass selectors). Updated by renderer and preprocessing.
`sorter_dis`	Sorter	STORAGE / INDIRECT / COPY_DST	Stores `dispatch_x/y/z` for indirect histogram and scatter passes plus draw calls.
`sorter_bg_pre`	Sorter	Bind group	Exposes `sorter_uni`, `key_a`, `payload_a`, `sorter_dis` to preprocessing (`@group(2)`).
`sorter_render_bg`	Sorter	Bind group	Exposes `sorter_uni` and `payload_a` to the renderer (`@group(1)`).

PointCloudSortStuff bundles the resources above so the renderer can keep a cache per point cloud or per global buffer.

GPU Pipeline (4 passes)

Zero pass: clears histograms and shared memory.
Histogram pass: each workgroup (256 threads x 15 rows) scans 3840 keys, extracts eight bit digits for all four radix passes, and atomically accumulates counts.
Prefix pass: a 128 thread workgroup performs an exclusive scan over the 256 entry histograms to produce global offsets.
Scatter passes: two entry points (scatter_even, scatter_odd) handle passes {0,2} and {1,3}, using ping pong buffers to avoid hazards. Each thread computes a local rank, adds the global offset, and writes the new key and payload order.

Key constants: - Eight bit radix -> four passes for 32 bit keys. - Workgroup sizes: histogram and scatter = 256, prefix = 128. - RS_HISTOGRAM_BLOCK_ROWS = 15 so each workgroup touches 3840 keys. - All derived constants are baked into the WGSL in processShaderTemplate to keep padding and shared memory sizes in sync with the CPU helpers.

Integration Points

Preprocessing binds sorter_bg_pre as @group(2) to write depth keys, payload indices, keys_size, and indirect dispatch counts. Keys are padded to the next multiple of 256 * 15 so the histogram pass can iterate without bounds checks.
Renderer caches one PointCloudSortStuff per point cloud. prepareMulti resets the indirect buffer (recordResetIndirectBuffer), computes base offsets, runs preprocessing, then calls recordSortIndirect once using the dispatch data that preprocessing produced. The render pass binds sorter_render_bg to read sorted indices and instance counts.
Dynamic point counts: when ONNX pipelines vary the number of visible splats, preprocessing overwrites GeneralInfo.keys_size and sorter_dis.dispatch_x. The sorter indirect path consumes those values directly, so no CPU read back is required.
Testing: GPURSSorter.create tries several subgroup sizes (16 -> 32 -> 16 -> 8 -> 1) and runs testSort (sorting 8192 floats) to ensure the compiled pipelines work on the current adapter before exposing the sorter to the renderer.

Usage Example

import { GPURSSorter } from 'src/sort';

const sorter = await GPURSSorter.create(device, device.queue);
const sortStuff = sorter.createSortStuff(device, pointCloud.numPoints);
sortResourcesCache.set(pointCloud, sortStuff);

// Before preprocessing each frame
sorter.recordResetIndirectBuffer(sortStuff.sorter_dis, sortStuff.sorter_uni, device.queue);

// ... preprocessing writes depth keys and payloads into sortStuff ...

const encoder = device.createCommandEncoder();
sorter.recordSortIndirect(sortStuff, sortStuff.sorter_dis, encoder);
device.queue.submit([encoder.finish()]);

// Render pass
pass.setBindGroup(1, sortStuff.sorter_render_bg);
pass.drawIndirect(drawIndirectBuffer, 0);

Architecture – Workgroup coordination, LDS layout, and indirect dispatch sequencing.
API Reference – Exported sorter types, creation helpers, and command encoders.
Preprocess Module – Describes how depth keys and payloads are generated before sorting.
Renderer Module – Shows how sorted indirect buffers drive the draw passes.