Preprocessing Module Architecture

This document explains how GaussianPreprocessor stitches WebGPU resources together, packs uniforms, and drives preprocess.wgsl so that multiple point clouds can be projected into a single global 2D splat buffer every frame. It focuses on the Phase B (multi-model) path that GaussianRenderer.prepareMulti executes today.

Pipeline Overview

┌──────────────┐  ┌──────────────────────────────┐  ┌────────────────┐
│ PointCloud   │──▶│  GaussianPreprocessor        │──▶│ GPURSSorter    │
│ GPU buffers  │  │  dispatchModel(...)          │  │ (radix sort)   │
└─────┬────────┘  │ ├ camera + settings uniforms │  └────────┬───────┘
      │           │ ├ project, cull, evaluate SH │           │
      │           │ └ write global splat slice  │           │
      │           └──────────────┬───────────────┘     Renderer
      │                          │
      │            Global splat buffer + sort counters
      │           (indirect draw)

Characteristics:

One compute dispatch per point cloud, 256 threads/workgroup, one Gaussian per thread.
All models share the same splat2D buffer and sort resources; each dispatch writes into [baseOffset, baseOffset + count).
Two pipelines exist at runtime: SH (view-dependent) and raw RGB (for DynamicPointCloud). The shader path is chosen via the USE_RAW_COLOR pipeline constant.
Sort counters (keys_size, indirect dispatch_x) are updated atomically inside the shader so the radix sorter can run via indirect dispatch without CPU intervention.

Bind Groups & Pipeline Layout

initialize() builds a pipeline layout with four groups so that we remain under WebGPU's bind-group limit and still support multi-model output:

@group(0) Camera uniforms (272 B)
  view, view⁻¹, proj (Y-flipped), proj⁻¹, viewport, focal lengths

@group(1) Point cloud data
  binding 0: Gaussian buffer (read-only storage)
  binding 1: SH / raw-color buffer (read-only storage)
  binding 2: splat2D output (storage) — bound to the global buffer
  binding 3: point-cloud draw uniforms (uniform buffer)

@group(2) Sorting buffers (from GPURSSorter.createPreprocessBindGroupLayout)
  binding 0: sort infos & atomic counters (storage)
  binding 1: depth keys (storage)
  binding 2: payload indices (storage)
  binding 3: indirect dispatch buffer (storage)

@group(3) Settings + model params
  binding 0: render settings uniform (80 B)
  binding 1: model params uniform (128 B, owned by PointCloud)

The compute pipeline is compiled from preprocess.wgsl, with the SH degree injected into the WGSL source and USE_RAW_COLOR supplied through pipeline constants.

Dispatch Flow (`dispatchModel`)

For each point cloud in the frame, the renderer calls GaussianPreprocessor.dispatchModel({...}, encoder). The method performs the following steps before recording the compute pass:

Pack camera data — packCameraUniforms fills a 272‑byte scratch buffer with view/projection matrices (including a WebGPU Y‑flip), their inverses, the viewport, and focal lengths derived from PerspectiveCamera.projection.focal(). The buffer is flushed via UniformBuffer.setData + flush.
Pack render settings — packSettingsUniforms writes clipping boxes, gaussianScaling, maxSHDegree, env/mip flags, kernelSize, walltime, sceneExtend, and scene center into the 80‑byte render-settings uniform, then flushes it.
Update per-model params — pointCloud.updateModelParamsWithOffset(modelMatrix, baseOffset) stores the transform and slice offset in the point cloud's own 128‑byte uniform. If the point cloud exposes setPrecisionForShader (dynamic ONNX path), it is invoked so quantization metadata (data types, scales, zero-points) lands in bytes 96-119 before the buffer is flushed.
Handle dynamic counts — When countBuffer is provided, the preprocessor flushes pointCloud.modelParamsUniforms first, then copies four bytes from countBuffer into the model-params buffer at byte offset 68 (the num_points field). This lets ONNX generators drive indirect draws without a CPU round-trip.
Bind resources — Groups 0-3 are assembled using the freshly updated buffers. Group 1's binding 2 points at the global splat2D buffer rather than the point-cloud-local buffer.
Dispatch — Workgroups = ceil(pointCloud.numPoints / 256). The compute pass writes splats into the base-offset region, updates sort keys/payloads, and increments sorter counters atomically.

After every model has been dispatched, the renderer triggers a single GPURSSorter.recordSortIndirect(...) followed by one indirect draw. The indirect draw's instanceCount is populated by copying sorter_uni.keys_size into the draw buffer.

Uniform Layouts

Camera Uniform (272 bytes)

0-63    : view matrix (mat4x4<f32>)
64-127  : view inverse
128-191 : projection matrix (with Y flip)
192-255 : projection inverse
256-263 : viewport (width, height)
264-271 : focal lengths (fx, fy)

Render Settings Uniform (80 bytes)

0-15    : clipping box min (vec4)
16-31   : clipping box max (vec4)
32      : gaussianScaling (f32)
36      : maxSHDegree (u32)
40      : showEnvMap flag (u32)
44      : mipSplatting flag (u32)
48      : kernelSize (f32)
52      : walltime (f32)
56      : sceneExtend (f32)
60      : padding
64-79   : scene center (vec3 + padding)

Model Params Uniform (128 bytes)

Managed inside PointCloud. It stores the model matrix, baseOffset (u32@64), num_points (u32@68), gaussian scaling, max SH degree, kernel/opacity/cutoff scales, render mode, and precision metadata (data types + scales/zero-points). Preprocessing relies on this buffer already being updated before dispatch.

Shader Internals (`preprocess.wgsl`)

Workgroup Topology

@compute @workgroup_size(256, 1, 1)
fn preprocess(@builtin(global_invocation_id) gid: vec3<u32>) {
  let idx = gid.x;
  if (idx >= arrayLength(&gaussians)) {
    return;
  }
  // ... process one Gaussian
}

256 threads per workgroup strikes a balance between warp occupancy, register pressure, and memory coalescing while matching the sorter's keys_per_workgroup = 256 * 15 constant used for indirect dispatch sizing.

Buffer Layouts

struct Gaussian {
  pos_opacity: array<u32, 2>; // packed f16 xyz + opacity
  cov        : array<u32, 3>; // packed f16 covariance (6 values)
}

struct Splat {
  v0: u32; v1: u32;          // major/minor axes as packed f16
  pos: u32;                  // screen position (f16)
  color0: u32; color1: u32;  // packed RGBA (f16)
}

The SH/RAW buffer is an array<array<u32, 24>>, giving 48 f16 values per Gaussian (16 SH coefficients × 3 channels) which the shader unpacks on demand.

Projection & Covariance

Unpack the symmetric 3×3 covariance matrix from three u32 values (six f16s) and scale it by the squared Gaussian scaling factor.
Compute the Jacobian of the perspective projection using the focal lengths and camera-space depth.
Extract the world-to-camera rotation (transpose of the upper-left 3×3 block of the view matrix).
Project via Σ₂D = (W·J)ᵀ · Σ₃D · (W·J) to produce a 2×2 covariance in screen space.

let Vrk = mat3x3<f32>( /* unpacked */ ) * scaling * scaling;
let J = mat3x3<f32>( /* focal, depth terms */ );
let W = transpose(mat3x3<f32>(camera.view[0].xyz, camera.view[1].xyz, camera.view[2].xyz));
let T = W * J;
let cov2d = transpose(T) * Vrk * T;

Eigenvalues/vectors are derived analytically, kernel_size is added to the diagonal for anti-aliasing, and the minor eigenvalue is clamped to ≥ 0.1 to keep ellipses well-behaved. Each eigenvector is scaled by sqrt(2λ) to yield the semi-axes that the renderer expects.

Color Evaluation

SH mode — evaluate_sh(dir, sh_deg) accelerates precomputed basis polynomials up to degree 3. Degree thresholds let the shader skip higher-order terms when settings.maxSHDegree is lower, reducing ALU cost.
Raw RGB mode — When the pipeline is created with useRawColor=true, a specialization constant bypasses SH math and treats the SH buffer as direct RGBA (still stored as packed f16 pairs).

Visibility & Culling

Clipping box: reject splats outside settings.clipping_box_min/max in world space.
Frustum test: after projection, ensure 0 < z < 1 and -1.2w < x,y < 1.2w to keep a small safety margin.
Mip-splatting (optional): if enabled, compare determinants before/after adding kernel_size to modulate opacity and reduce flicker.

All tests short-circuit via return to minimize branch divergence inside a warp.

Atomic Handshake with the Sorter

let output_idx = atomicAdd(&sort_infos.keys_size, 1u);
points_2d[output_idx] = packed_splat;
sort_keys[output_idx] = bitcast<u32>(zfar - pos_ndc.z);
sort_payload[output_idx] = output_idx;

let KEYS_PER_WORKGROUP = 256u * 15u;
if (output_idx % KEYS_PER_WORKGROUP) == 0u {
  atomicAdd(&sort_dispatch.dispatch_x, 1u);
}

These atomics keep sort_infos.keys_size in sync with the number of visible splats and increment the indirect-dispatch counter every time a new block of keys is filled. The renderer later copies sort_infos.keys_size into the draw-indirect buffer to set the final instanceCount.

Multi-model / Global Path

Phase B consolidates resource usage across models without changing the shader contract:

The renderer allocates one large splat2D buffer plus one PointCloudSortStuff structure whose capacity matches the sum of all point counts.
For every model: compute baseOffset, bind the shared buffers, and call dispatchModel (using the SH or RGB preprocessor depending on pointCloud.colorMode).
When all models are processed, call recordSortIndirect once and issue a single indirect draw using the global sorted payloads.

This approach eliminates redundant sort/draw passes, keeps bind-group layouts stable, and sets the stage for future batching or occlusion-aware scheduling.

Performance & Diagnostics

Scratch buffers — Camera and settings uniforms reuse preallocated ArrayBuffers to avoid per-frame allocations.
Dual pipelines — The renderer creates two GaussianPreprocessor instances (SH + RGB) so the shader can specialize on color mode at compile time.
Precision metadata — Dynamic point clouds call setPrecisionForShader before dispatch so the shader knows how to decode INT8/FP16 storage.
Atomic hot spots — Extremely dense scenes can become atomic-bound; reducing gaussianScaling, tightening clipping boxes, or enabling mip-splatting helps lower the visible count per frame.
Debug tooling — debugCountValues() taps into debugCountPipeline (see src/utils/debug-gpu-buffers.ts) to compare ONNX count buffers against the model-parameter uniform, making indirect-count issues easier to diagnose.

References

src/preprocess/gaussian_preprocessor.ts — authoritative TypeScript implementation.
src/shaders/preprocess.wgsl — the compute shader discussed above.
src/renderer/gaussian_renderer.ts — shows how the renderer instantiates preprocessors and calls dispatchModel during prepareMulti.
src/sort/radix_sort.ts — details the sorter resources (sorter_bg_pre, sorter_uni, sorter_dis) that preprocessing consumes.