Sorting Module Architecture
GPURSSorter implements a four pass GPU radix sort that reorders depth keys and payload indices entirely on the GPU. This document focuses on how the TypeScript wrapper, buffer layout, and WGSL shaders fit together inside the multi model rendering path.
High level pipeline
preprocess.wgsl -- writes depth keys + payload indices + counters --> sorter_bg_pre
radix sort pass 0 -- zero_histograms (indirect or fixed dispatch)
radix sort pass 1 -- calculate_histogram
radix sort pass 2 -- prefix_histogram
radix sort pass 3 -- scatter_even and scatter_odd (ping pong buffers)
renderer -- reads payload_a through sorter_render_bg -> indirect draw
Only one sorter instance is needed per GPUDevice. The renderer caches one PointCloudSortStuff per point cloud (or per global buffer) so buffers can be reused between frames.
Bind group layouts
GPURSSorter creates three layouts:
- Sort pipeline layout (private): bind group with six storage buffers
- binding 0:
sorter_uni(GeneralInfo) - binding 1:
internal_mem(histograms + partitions) - bindings 2/3:
key_a/key_b - bindings 4/5:
payload_a/payload_b - Render layout (
createRenderBindGroupLayout) - binding 0:
sorter_uni(read only) - binding 4:
payload_a(the buffer that holds the final order) The renderer binds this as@group(1)for the draw pass. - Preprocess layout (
createPreprocessBindGroupLayout) - binding 0:
sorter_uni - binding 1:
key_a - binding 2:
payload_a - binding 3:
sorter_dis(indirect dispatch buffer) Preprocessing binds this as@group(2)so it can write keys, payloads, and atomic counters directly.
Buffer layout
| Name | Description |
|---|---|
key_a, key_b |
Ping pong key buffers (32 bit floats reinterpreted as u32). The scatter pass alternates reads and writes per radix pass. |
payload_a, payload_b |
Ping pong payload buffers. payload_a becomes the final sorted index buffer bound by the renderer. |
internal_mem |
Scratch buffer containing histograms for all passes, shared memory partitions, and workgroup lookback data. Size grows with padded key count. |
sorter_uni |
GeneralInfo struct (5 x u32) packed as { keys_size, padded_size, passes, even_pass, odd_pass }. Preprocessing updates keys_size every frame. |
sorter_dis |
IndirectDispatch struct (dispatch_x, dispatch_y, dispatch_z). Preprocessing stores how many histogram/scatter workgroups are needed so the sorter can use dispatchWorkgroupsIndirect. |
Padding
Keys are padded to keys_per_workgroup = HISTOGRAM_WG_SIZE * RS_HISTOGRAM_BLOCK_ROWS = 256 * 15 = 3840. createKeyvalBuffers rounds up the requested point count to this multiple when allocating buffers, and GeneralInfo.padded_size stores the padded value. Preprocessing is responsible for zero filling the extra slots when fewer splats are visible.
Compute passes
1. zero_histograms
- Workgroup size: 256
- Clears histograms and partition metadata in
internal_mem. - When indirect mode is used, dispatch count is read from
sorter_dis.dispatch_x(which preprocessing set while emitting splats).
2. calculate_histogram
- Workgroup size: 256, 15 rows per thread, 3840 keys per workgroup.
- Each thread loads multiple depth keys, extracts the digit for all four radix passes, and accumulates counts in shared memory before atomically adding to global histograms.
- Constants such as
histogram_sg_sizeandrs_histogram_block_rowsare injected byprocessShaderTemplateso WGSL and TypeScript derive the same memory sizes.
3. prefix_histogram
- Workgroup size: 128.
- Implements an exclusive scan over each 256 entry histogram in shared memory (up sweep + down sweep).
- Results are written back to the histogram portion of
internal_memso the scatter pass can look up global offsets per digit/per pass.
4. scatter_even and scatter_odd
- Workgroup size: 256.
scatter_evenhandles passes 0 and 2 (reading fromkey_a, writing tokey_b),scatter_oddhandles passes 1 and 3 (reading fromkey_b, writing tokey_a). The same logic applies to payload buffers.- Each thread computes a local rank for its digit, adds the global offset from the histogram prefix, and writes the key/payload to the computed destination index.
- At the end of the fourth pass,
key_aandpayload_acontain the final sorted order.
Indirect vs fixed dispatch
recordSort(fixed) calls zero -> histogram -> prefix -> scatter with explicit workgroup counts derived fromnumPoints.recordSortIndirectbuilds each compute pass usingdispatchWorkgroupsIndirectand the counts stored insorter_dis. This is the path used by the renderer because preprocessing computes the actual number of visible splats every frame.recordResetIndirectBufferresets bothdispatch_xandkeys_sizeto zero so preprocessing can increment them atomically.
Integration with preprocessing
Preprocessing binds sorter_bg_pre and performs three critical tasks:
- Writes depth keys (packed floats) into
key_aand payload indices intopayload_aat the same offsets it writes the 2D splats. - Uses atomics on
sorter_uni.keys_sizeto track the number of visible splats and onsorter_dis.dispatch_xto track how many histogram workgroups are required (one increment per256 * 15splats). - Pads any unused slots up to
padded_sizewith zeros so the scatter pass stays within bounds.
Once preprocessing finishes, the renderer calls recordSortIndirect with the same PointCloudSortStuff. No CPU read back is needed because both the key counts and dispatch counts are already waiting inside sorter_uni and sorter_dis.
Integration with the renderer
PointCloudSortStuff.sorter_render_bgis bound at@group(1)in the render pipeline, providingsorter_uni(for instance counts) andpayload_a(sorted indices).- After sorting, the renderer copies
sorter_uni.keys_sizeinto the draw indirect buffer so the splat pipeline knows how many instances to draw. - In the multi model path, a single
recordSortIndirectand a single draw call cover every model because preprocessing wrote each model into a unique range of the global splat buffer and payload buffer.
Initialization and validation
GPURSSorter.create(device, queue) is async because it:
- Iterates through candidate subgroup sizes (16, 32, 16, 8, 1) and builds pipelines for each trial.
- Calls
testSort(sorting 8192 floats) to ensure the chosen configuration works on the current adapter. - Returns a sorter instance only when a configuration succeeds; otherwise it throws.
Debugging tips
- Verify
sorter_uni.keys_sizeafter preprocessing to confirm the atomic counters match the expected visible splat count. - Ensure
dispatch_xequalsceil(keys_size / (256 * 15))before callingrecordSortIndirect. - Remember that after four passes the sorted data resides in
payload_a. If the renderer sees stale results, it usually means a missing ping pong flip or preprocessing wrote into the wrong buffer.