Current foundation models for 3D shapes excel at global tasks (retrieval, classification) but transfer poorly to local part-level reasoning. Recent approaches leverage vision and language foundation models to solve dense tasks through multi-view renderings and text queries. While promising, these pipelines require expensive inference over multiple renderings, depend heavily on large language model (LLM) prompt engineering for captions, and fail to exploit the inherent 3D geometry of shapes.
We address this gap by introducing an encoder-only 3D model that produces language-aligned patch-level features directly from point clouds. Our pre-training approach builds on existing data engines that generate part-annotated 3D shapes by pairing multi-view SAM regions with VLM captioning. Using this data, we train a point cloud transformer encoder in two stages: (1) distillation of dense 2D features from visual encoders such as DINOv2 into 3D patches, and (2) alignment of these patch embeddings with part-level text embeddings through a multi-positive contrastive objective.
Our 3D encoder achieves zero-shot 3D part segmentation with fast single-pass inference and no test-time multi-view rendering, while significantly outperforming previous rendering-based and feed-forward approaches across several 3D part segmentation benchmarks.
PatchAlign3D pre-training approach. Given an input point cloud, we extract multi-view visual features using a 2D backbone and back-project them into 3D space. In Stage 1, the 3D transformer encoder operates on sampled point cloud patches and learns to align its output patch tokens with the back-projected visual features. In Stage 2, we initialize from Stage 1, freeze all earlier layers, and train only the last transformer block and projector to align patch-level features with textual embeddings in a contrastive manner.
Zero-shot part segmentation results on ShapeNetPart. PatchAlign3D significantly outperforms the strongest rendering and feed-forward approaches.
Qualitative comparisons on ShapeNetPart. We show ground truth (top row) and predictions from different baselines and PatchAlign3D across six representative shapes. The part legends below each column indicate the semantic labels used for zero-shot prediction. PatchAlign3D produces more precise and coherent segmentations, despite relying solely on an encoder and patch-level features.
@misc{hadgi2026patchalign3dlocalfeaturealignment,
title={PatchAlign3D: Local Feature Alignment for Dense 3D Shape understanding},
author={Souhail Hadgi and Bingchen Gong and Ramana Sundararaman and Emery Pierson and Lei Li and Peter Wonka and Maks Ovsjanikov},
year={2026},
eprint={2601.02457},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2601.02457},
}