Sorting Module Architecture

GPURSSorter implements a four pass GPU radix sort that reorders depth keys and payload indices entirely on the GPU. This document focuses on how the TypeScript wrapper, buffer layout, and WGSL shaders fit together inside the multi model rendering path.

High level pipeline

preprocess.wgsl  -- writes depth keys + payload indices + counters --> sorter_bg_pre
radix sort pass 0 -- zero_histograms      (indirect or fixed dispatch)
radix sort pass 1 -- calculate_histogram
radix sort pass 2 -- prefix_histogram
radix sort pass 3 -- scatter_even and scatter_odd (ping pong buffers)
renderer          -- reads payload_a through sorter_render_bg -> indirect draw

Only one sorter instance is needed per GPUDevice. The renderer caches one PointCloudSortStuff per point cloud (or per global buffer) so buffers can be reused between frames.

Bind group layouts

GPURSSorter creates three layouts:

Sort pipeline layout (private): bind group with six storage buffers
binding 0: sorter_uni (GeneralInfo)
binding 1: internal_mem (histograms + partitions)
bindings 2/3: key_a / key_b
bindings 4/5: payload_a / payload_b
Render layout (createRenderBindGroupLayout)
binding 0: sorter_uni (read only)
binding 4: payload_a (the buffer that holds the final order) The renderer binds this as @group(1) for the draw pass.
Preprocess layout (createPreprocessBindGroupLayout)
binding 0: sorter_uni
binding 1: key_a
binding 2: payload_a
binding 3: sorter_dis (indirect dispatch buffer) Preprocessing binds this as @group(2) so it can write keys, payloads, and atomic counters directly.

Buffer layout

Name	Description
`key_a`, `key_b`	Ping pong key buffers (32 bit floats reinterpreted as u32). The scatter pass alternates reads and writes per radix pass.
`payload_a`, `payload_b`	Ping pong payload buffers. `payload_a` becomes the final sorted index buffer bound by the renderer.
`internal_mem`	Scratch buffer containing histograms for all passes, shared memory partitions, and workgroup lookback data. Size grows with padded key count.
`sorter_uni`	`GeneralInfo` struct (5 x u32) packed as `{ keys_size, padded_size, passes, even_pass, odd_pass }`. Preprocessing updates `keys_size` every frame.
`sorter_dis`	`IndirectDispatch` struct (`dispatch_x, dispatch_y, dispatch_z`). Preprocessing stores how many histogram/scatter workgroups are needed so the sorter can use `dispatchWorkgroupsIndirect`.

Padding

Keys are padded to keys_per_workgroup = HISTOGRAM_WG_SIZE * RS_HISTOGRAM_BLOCK_ROWS = 256 * 15 = 3840. createKeyvalBuffers rounds up the requested point count to this multiple when allocating buffers, and GeneralInfo.padded_size stores the padded value. Preprocessing is responsible for zero filling the extra slots when fewer splats are visible.

Compute passes

1. `zero_histograms`

Workgroup size: 256
Clears histograms and partition metadata in internal_mem.
When indirect mode is used, dispatch count is read from sorter_dis.dispatch_x (which preprocessing set while emitting splats).

2. `calculate_histogram`

Workgroup size: 256, 15 rows per thread, 3840 keys per workgroup.
Each thread loads multiple depth keys, extracts the digit for all four radix passes, and accumulates counts in shared memory before atomically adding to global histograms.
Constants such as histogram_sg_size and rs_histogram_block_rows are injected by processShaderTemplate so WGSL and TypeScript derive the same memory sizes.

3. `prefix_histogram`

Workgroup size: 128.
Implements an exclusive scan over each 256 entry histogram in shared memory (up sweep + down sweep).
Results are written back to the histogram portion of internal_mem so the scatter pass can look up global offsets per digit/per pass.

4. `scatter_even` and `scatter_odd`

Workgroup size: 256.
scatter_even handles passes 0 and 2 (reading from key_a, writing to key_b), scatter_odd handles passes 1 and 3 (reading from key_b, writing to key_a). The same logic applies to payload buffers.
Each thread computes a local rank for its digit, adds the global offset from the histogram prefix, and writes the key/payload to the computed destination index.
At the end of the fourth pass, key_a and payload_a contain the final sorted order.

Indirect vs fixed dispatch

recordSort (fixed) calls zero -> histogram -> prefix -> scatter with explicit workgroup counts derived from numPoints.
recordSortIndirect builds each compute pass using dispatchWorkgroupsIndirect and the counts stored in sorter_dis. This is the path used by the renderer because preprocessing computes the actual number of visible splats every frame.
recordResetIndirectBuffer resets both dispatch_x and keys_size to zero so preprocessing can increment them atomically.

Integration with preprocessing

Preprocessing binds sorter_bg_pre and performs three critical tasks:

Writes depth keys (packed floats) into key_a and payload indices into payload_a at the same offsets it writes the 2D splats.
Uses atomics on sorter_uni.keys_size to track the number of visible splats and on sorter_dis.dispatch_x to track how many histogram workgroups are required (one increment per 256 * 15 splats).
Pads any unused slots up to padded_size with zeros so the scatter pass stays within bounds.

Once preprocessing finishes, the renderer calls recordSortIndirect with the same PointCloudSortStuff. No CPU read back is needed because both the key counts and dispatch counts are already waiting inside sorter_uni and sorter_dis.

Integration with the renderer

PointCloudSortStuff.sorter_render_bg is bound at @group(1) in the render pipeline, providing sorter_uni (for instance counts) and payload_a (sorted indices).
After sorting, the renderer copies sorter_uni.keys_size into the draw indirect buffer so the splat pipeline knows how many instances to draw.
In the multi model path, a single recordSortIndirect and a single draw call cover every model because preprocessing wrote each model into a unique range of the global splat buffer and payload buffer.

Initialization and validation

GPURSSorter.create(device, queue) is async because it:

Iterates through candidate subgroup sizes (16, 32, 16, 8, 1) and builds pipelines for each trial.
Calls testSort (sorting 8192 floats) to ensure the chosen configuration works on the current adapter.
Returns a sorter instance only when a configuration succeeds; otherwise it throws.

Debugging tips

Verify sorter_uni.keys_size after preprocessing to confirm the atomic counters match the expected visible splat count.
Ensure dispatch_x equals ceil(keys_size / (256 * 15)) before calling recordSortIndirect.
Remember that after four passes the sorted data resides in payload_a. If the renderer sees stale results, it usually means a missing ping pong flip or preprocessing wrote into the wrong buffer.