Preprocessing Module Architecture
This document explains how GaussianPreprocessor stitches WebGPU resources together, packs uniforms, and drives preprocess.wgsl so that multiple point clouds can be projected into a single global 2D splat buffer every frame. It focuses on the Phase B (multi-model) path that GaussianRenderer.prepareMulti executes today.
Pipeline Overview
┌──────────────┐ ┌──────────────────────────────┐ ┌────────────────┐
│ PointCloud │──▶│ GaussianPreprocessor │──▶│ GPURSSorter │
│ GPU buffers │ │ dispatchModel(...) │ │ (radix sort) │
└─────┬────────┘ │ ├ camera + settings uniforms │ └────────┬───────┘
│ │ ├ project, cull, evaluate SH │ │
│ │ └ write global splat slice │ │
│ └──────────────┬───────────────┘ Renderer
│ │
│ Global splat buffer + sort counters
│ (indirect draw)
Characteristics:
- One compute dispatch per point cloud, 256 threads/workgroup, one Gaussian per thread.
- All models share the same
splat2Dbuffer and sort resources; each dispatch writes into[baseOffset, baseOffset + count). - Two pipelines exist at runtime: SH (view-dependent) and raw RGB (for
DynamicPointCloud). The shader path is chosen via theUSE_RAW_COLORpipeline constant. - Sort counters (
keys_size, indirectdispatch_x) are updated atomically inside the shader so the radix sorter can run via indirect dispatch without CPU intervention.
Bind Groups & Pipeline Layout
initialize() builds a pipeline layout with four groups so that we remain under WebGPU's bind-group limit and still support multi-model output:
@group(0) Camera uniforms (272 B)
view, view⁻¹, proj (Y-flipped), proj⁻¹, viewport, focal lengths
@group(1) Point cloud data
binding 0: Gaussian buffer (read-only storage)
binding 1: SH / raw-color buffer (read-only storage)
binding 2: splat2D output (storage) — bound to the global buffer
binding 3: point-cloud draw uniforms (uniform buffer)
@group(2) Sorting buffers (from GPURSSorter.createPreprocessBindGroupLayout)
binding 0: sort infos & atomic counters (storage)
binding 1: depth keys (storage)
binding 2: payload indices (storage)
binding 3: indirect dispatch buffer (storage)
@group(3) Settings + model params
binding 0: render settings uniform (80 B)
binding 1: model params uniform (128 B, owned by PointCloud)
The compute pipeline is compiled from preprocess.wgsl, with the SH degree injected into the WGSL source and USE_RAW_COLOR supplied through pipeline constants.
Dispatch Flow (dispatchModel)
For each point cloud in the frame, the renderer calls GaussianPreprocessor.dispatchModel({...}, encoder). The method performs the following steps before recording the compute pass:
- Pack camera data —
packCameraUniformsfills a 272‑byte scratch buffer with view/projection matrices (including a WebGPU Y‑flip), their inverses, the viewport, and focal lengths derived fromPerspectiveCamera.projection.focal(). The buffer is flushed viaUniformBuffer.setData + flush. - Pack render settings —
packSettingsUniformswrites clipping boxes,gaussianScaling,maxSHDegree, env/mip flags,kernelSize,walltime,sceneExtend, and scene center into the 80‑byte render-settings uniform, then flushes it. - Update per-model params —
pointCloud.updateModelParamsWithOffset(modelMatrix, baseOffset)stores the transform and slice offset in the point cloud's own 128‑byte uniform. If the point cloud exposessetPrecisionForShader(dynamic ONNX path), it is invoked so quantization metadata (data types, scales, zero-points) lands in bytes 96-119 before the buffer is flushed. - Handle dynamic counts — When
countBufferis provided, the preprocessor flushespointCloud.modelParamsUniformsfirst, then copies four bytes fromcountBufferinto the model-params buffer at byte offset 68 (thenum_pointsfield). This lets ONNX generators drive indirect draws without a CPU round-trip. - Bind resources — Groups 0-3 are assembled using the freshly updated buffers. Group 1's binding 2 points at the global
splat2Dbuffer rather than the point-cloud-local buffer. - Dispatch — Workgroups =
ceil(pointCloud.numPoints / 256). The compute pass writes splats into the base-offset region, updates sort keys/payloads, and increments sorter counters atomically.
After every model has been dispatched, the renderer triggers a single GPURSSorter.recordSortIndirect(...) followed by one indirect draw. The indirect draw's instanceCount is populated by copying sorter_uni.keys_size into the draw buffer.
Uniform Layouts
Camera Uniform (272 bytes)
0-63 : view matrix (mat4x4<f32>)
64-127 : view inverse
128-191 : projection matrix (with Y flip)
192-255 : projection inverse
256-263 : viewport (width, height)
264-271 : focal lengths (fx, fy)
Render Settings Uniform (80 bytes)
0-15 : clipping box min (vec4)
16-31 : clipping box max (vec4)
32 : gaussianScaling (f32)
36 : maxSHDegree (u32)
40 : showEnvMap flag (u32)
44 : mipSplatting flag (u32)
48 : kernelSize (f32)
52 : walltime (f32)
56 : sceneExtend (f32)
60 : padding
64-79 : scene center (vec3 + padding)
Model Params Uniform (128 bytes)
Managed inside PointCloud. It stores the model matrix, baseOffset (u32@64), num_points (u32@68), gaussian scaling, max SH degree, kernel/opacity/cutoff scales, render mode, and precision metadata (data types + scales/zero-points). Preprocessing relies on this buffer already being updated before dispatch.
Shader Internals (preprocess.wgsl)
Workgroup Topology
@compute @workgroup_size(256, 1, 1)
fn preprocess(@builtin(global_invocation_id) gid: vec3<u32>) {
let idx = gid.x;
if (idx >= arrayLength(&gaussians)) {
return;
}
// ... process one Gaussian
}
256 threads per workgroup strikes a balance between warp occupancy, register pressure, and memory coalescing while matching the sorter's keys_per_workgroup = 256 * 15 constant used for indirect dispatch sizing.
Buffer Layouts
struct Gaussian {
pos_opacity: array<u32, 2>; // packed f16 xyz + opacity
cov : array<u32, 3>; // packed f16 covariance (6 values)
}
struct Splat {
v0: u32; v1: u32; // major/minor axes as packed f16
pos: u32; // screen position (f16)
color0: u32; color1: u32; // packed RGBA (f16)
}
The SH/RAW buffer is an array<array<u32, 24>>, giving 48 f16 values per Gaussian (16 SH coefficients × 3 channels) which the shader unpacks on demand.
Projection & Covariance
- Unpack the symmetric 3×3 covariance matrix from three
u32values (six f16s) and scale it by the squared Gaussian scaling factor. - Compute the Jacobian of the perspective projection using the focal lengths and camera-space depth.
- Extract the world-to-camera rotation (transpose of the upper-left 3×3 block of the view matrix).
- Project via
Σ₂D = (W·J)ᵀ · Σ₃D · (W·J)to produce a 2×2 covariance in screen space.
let Vrk = mat3x3<f32>( /* unpacked */ ) * scaling * scaling;
let J = mat3x3<f32>( /* focal, depth terms */ );
let W = transpose(mat3x3<f32>(camera.view[0].xyz, camera.view[1].xyz, camera.view[2].xyz));
let T = W * J;
let cov2d = transpose(T) * Vrk * T;
Eigenvalues/vectors are derived analytically, kernel_size is added to the diagonal for anti-aliasing, and the minor eigenvalue is clamped to ≥ 0.1 to keep ellipses well-behaved. Each eigenvector is scaled by sqrt(2λ) to yield the semi-axes that the renderer expects.
Color Evaluation
- SH mode —
evaluate_sh(dir, sh_deg)accelerates precomputed basis polynomials up to degree 3. Degree thresholds let the shader skip higher-order terms whensettings.maxSHDegreeis lower, reducing ALU cost. - Raw RGB mode — When the pipeline is created with
useRawColor=true, a specialization constant bypasses SH math and treats the SH buffer as direct RGBA (still stored as packed f16 pairs).
Visibility & Culling
- Clipping box: reject splats outside
settings.clipping_box_min/maxin world space. - Frustum test: after projection, ensure
0 < z < 1and-1.2w < x,y < 1.2wto keep a small safety margin. - Mip-splatting (optional): if enabled, compare determinants before/after adding
kernel_sizeto modulate opacity and reduce flicker.
All tests short-circuit via return to minimize branch divergence inside a warp.
Atomic Handshake with the Sorter
let output_idx = atomicAdd(&sort_infos.keys_size, 1u);
points_2d[output_idx] = packed_splat;
sort_keys[output_idx] = bitcast<u32>(zfar - pos_ndc.z);
sort_payload[output_idx] = output_idx;
let KEYS_PER_WORKGROUP = 256u * 15u;
if (output_idx % KEYS_PER_WORKGROUP) == 0u {
atomicAdd(&sort_dispatch.dispatch_x, 1u);
}
These atomics keep sort_infos.keys_size in sync with the number of visible splats and increment the indirect-dispatch counter every time a new block of keys is filled. The renderer later copies sort_infos.keys_size into the draw-indirect buffer to set the final instanceCount.
Multi-model / Global Path
Phase B consolidates resource usage across models without changing the shader contract:
- The renderer allocates one large
splat2Dbuffer plus onePointCloudSortStuffstructure whose capacity matches the sum of all point counts. - For every model: compute
baseOffset, bind the shared buffers, and calldispatchModel(using the SH or RGB preprocessor depending onpointCloud.colorMode). - When all models are processed, call
recordSortIndirectonce and issue a single indirect draw using the global sorted payloads.
This approach eliminates redundant sort/draw passes, keeps bind-group layouts stable, and sets the stage for future batching or occlusion-aware scheduling.
Performance & Diagnostics
- Scratch buffers — Camera and settings uniforms reuse preallocated
ArrayBuffers to avoid per-frame allocations. - Dual pipelines — The renderer creates two
GaussianPreprocessorinstances (SH + RGB) so the shader can specialize on color mode at compile time. - Precision metadata — Dynamic point clouds call
setPrecisionForShaderbefore dispatch so the shader knows how to decode INT8/FP16 storage. - Atomic hot spots — Extremely dense scenes can become atomic-bound; reducing
gaussianScaling, tightening clipping boxes, or enabling mip-splatting helps lower the visible count per frame. - Debug tooling —
debugCountValues()taps intodebugCountPipeline(seesrc/utils/debug-gpu-buffers.ts) to compare ONNX count buffers against the model-parameter uniform, making indirect-count issues easier to diagnose.
References
src/preprocess/gaussian_preprocessor.ts— authoritative TypeScript implementation.src/shaders/preprocess.wgsl— the compute shader discussed above.src/renderer/gaussian_renderer.ts— shows how the renderer instantiates preprocessors and callsdispatchModelduringprepareMulti.src/sort/radix_sort.ts— details the sorter resources (sorter_bg_pre,sorter_uni,sorter_dis) that preprocessing consumes.