Representing 3D Faces with Learnable B-Spline Volumes

Google
CUBE Teaser Figure

CUBE’s control features locally influence a face shape and therefore allow for precise shape editing by control swapping or interactive editing. We demonstrate the usefulness of CUBE with two applications; feed-forward facial scan registration where CUBE achieves state-of-the-art results and image-based regression.

Abstract

We present CUBE (Control-based Unified B-spline Encoding), a new geometric representation for human faces that combines B-spline volumes with learned features, and demonstrate its use as a decoder for 3D scan registration and monocular 3D face reconstruction. Unlike existing B-spline representations with 3D control points, CUBE is parametrized by a lattice (e.g., 8 x 8 x 8) of high-dimensional control features, increasing the model's expressivity. These features define a continuous, two-stage mapping from a 3D parametric domain to 3D Euclidean space via an intermediate feature space. First, high-dimensional control features are locally blended using the B-spline bases, yielding a high-dimensional feature vector whose first three values define a 3D base mesh. A small MLP then processes this feature vector to predict a residual displacement from the base shape, yielding the final refined 3D coordinates. To reconstruct 3D surfaces in dense semantic correspondence, CUBE is queried at 3D coordinates sampled from a fixed template mesh. Crucially, CUBE retains the local support property of traditional B-spline representations, enabling local surface editing by updating individual control features. We demonstrate the strengths of this representation by training transformer-based encoders to predict CUBE's control features from unstructured point clouds and monocular images, achieving state-of-the-art scan registration results compared to recent baselines.

The CUBE representation

Defined by a lattice of high-dimensional control features, CUBE reconstructs a 3D face in a two-stage process. First, 3D coordinates are sampled from a fixed template mesh. These coordinates are then mapped to high-dimensional features via B-spline interpolation using the lattice of control features. The first three values of the resulting high-dimensional feature vector form a coarse base shape. Finally, the full feature vector is input to a small MLP, which predicts coordinate offsets (residuals) from the base shape, resulting in the refined 3D point coordinates.


Feed-forward scan registration

We register scans to a common mesh topology by directly predicting the control features of CUBE. For this, input scan vertices are tokenized, concatenated with trainable control tokens, and then passed through multiple transformer layers. The resulting control token embeddings are extracted and reshaped to form the feature lattice of CUBE. Querying CUBE with a fixed template mesh points then reconstructs the registered 3D mesh.

CUBE Architecture Diagram

Our model can be used to register static and dynamic scans.


Additional scan registration results

Scan
Prediction
Point to scan distance
Overlay
Scan
Prediction
Point to scan distance
Overlay

Local Surface Editing

Crucially, CUBE retains the local support property of traditional B-spline representations. This enables local surface editing by directly updating individual control features, allowing for precise localized deformation control.


Registration-free retargeting

Transferring CUBE feature residuals from a source scan sequence to target subjects allows us to directly retarget facial performances from a source performance to a various target identities without having to register any of the scans.


Image-based reconstruction

CUBE control features can be regressed from images by concatenating learnable control embeddings to image patch tokens, and processing them with a standard vision transformer. With the exception of the patchify layer before the transformer encoder, our architecture for image-based regression is identical to the model we introduced for scan registration above. For image-based regression, we used a ViT-Large backbone with 8 × 8 × 8 control tokens.

Image-based Reconstruction Architecture Diagram

Input image
Prediction
Input image
Prediction

BibTeX Citation

@inproceedings{chandran26a,
  title={Representing 3D Faces with Learnable B-Spline Volumes},
  author={Chandran, Prashanth and Wang, Daoye and Bolkart, Timo},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition},
  year={2026}
}