PatchAlign3D: Local Feature Alignment for Dense 3D Shape Understanding

TL;DR

Q: What is the missing piece in current 3D foundation models?

A: While they excel at global tasks like classification, they often fail at dense, local part-level reasoning.

Q: How does PatchAlign3D solve this?

A: We introduce a two-stage training process that distills 2D visual features into 3D patches and aligns them with text, creating the first language-aligned local 3D encoder.

Q: Does it work on any object category?

A: Yes. It is fully zero-shot and open-vocabulary. You can segment parts of any object simply by typing a text query (e.g., "handle", "landing gear"), even if the model has never seen that category before.

Q: Is it efficient at test time?

A: Yes. PatchAlign3D is a single feed-forward encoder that requires no multi-view rendering at inference, enabling fast zero-shot part segmentation.

Abstract

Current foundation models for 3D shapes excel at global tasks (retrieval, classification) but transfer poorly to local part-level reasoning. Recent approaches leverage vision and language foundation models to solve dense tasks through multi-view renderings and text queries. While promising, these pipelines require expensive inference over multiple renderings, depend heavily on large language model (LLM) prompt engineering for captions, and fail to exploit the inherent 3D geometry of shapes.

We address this gap by introducing an encoder-only 3D model that produces language-aligned patch-level features directly from point clouds. Our pre-training approach builds on existing data engines that generate part-annotated 3D shapes by pairing multi-view SAM regions with VLM captioning. Using this data, we train a point cloud transformer encoder in two stages: (1) distillation of dense 2D features from visual encoders such as DINOv2 into 3D patches, and (2) alignment of these patch embeddings with part-level text embeddings through a multi-positive contrastive objective.

Our 3D encoder achieves zero-shot 3D part segmentation with fast single-pass inference and no test-time multi-view rendering, while significantly outperforming previous rendering-based and feed-forward approaches across several 3D part segmentation benchmarks.

Method

PatchAlign3D pre-training approach. Given an input point cloud, we extract multi-view visual features using a 2D backbone and back-project them into 3D space. In Stage 1, the 3D transformer encoder operates on sampled point cloud patches and learns to align its output patch tokens with the back-projected visual features. In Stage 2, we initialize from Stage 1, freeze all earlier layers, and train only the last transformer block and projector to align patch-level features with textual embeddings in a contrastive manner.

Zero-shot Part Segmentation

Zero-shot part segmentation results on ShapeNetPart. PatchAlign3D significantly outperforms the strongest rendering and feed-forward approaches.

Qualitative Results

Qualitative comparisons on ShapeNetPart. We show ground truth (top row) and predictions from different baselines and PatchAlign3D across six representative shapes. The part legends below each column indicate the semantic labels used for zero-shot prediction. PatchAlign3D produces more precise and coherent segmentations, despite relying solely on an encoder and patch-level features.

BibTeX

@misc{hadgi2026patchalign3dlocalfeaturealignment,
  title={PatchAlign3D: Local Feature Alignment for Dense 3D Shape understanding},
  author={Souhail Hadgi and Bingchen Gong and Ramana Sundararaman and Emery Pierson and Lei Li and Peter Wonka and Maks Ovsjanikov},
  year={2026},
  eprint={2601.02457},
  archivePrefix={arXiv},
  primaryClass={cs.CV},
  url={https://arxiv.org/abs/2601.02457},
}