Abstract
We introduce CosCAD, a novel framework for CAD model retrieval and pose alignment from a single image. Unlike previous methods that rely solely on image data and are sensitive to occlusion, CosCAD leverages cross-modal contrastive learning to integrate image, CAD model, and text features into a shared representation space. This improves retrieval accuracy, even when visual cues are ambiguous or objects are partially occluded. To enhance retrieval efficiency, we introduce Tri-Indexed Quantized Graph Search, which accelerates CAD retrieval using an optimized indexing structure. For pose alignment, we combine image and geometric features of CAD models to predict object rotation and scale, using an attention-based method to capture spatial correlations within the scene. This improves multi-object location estimation and 9-DoF pose alignment. Experimental results demonstrate that CosCAD outperforms existing methods such as ROCA and SPARC in both CAD model retrieval and pose estimation, while offering more than 6x speedup in retrieval for large datasets, underscoring its potential for interactive environments and autonomous systems.
| Original language | English |
|---|---|
| Title of host publication | Computational Visual Media: 13th International Conference, CVM 2025, Hong Kong SAR, China, April 19–21, 2025, Proceedings, Part I |
| Editors | Piotr DIDYK, Junhui HOU |
| Publisher | Springer |
| Chapter | 19 |
| Pages | 367-387 |
| Number of pages | 21 |
| ISBN (Electronic) | 9789819658091 |
| ISBN (Print) | 9789819658084 |
| DOIs | |
| Publication status | Published - 2025 |
| Externally published | Yes |
Publication series
| Name | Lecture Notes in Computer Science |
|---|---|
| Publisher | Springer |
| Volume | 15663 |
| ISSN (Print) | 0302-9743 |
| ISSN (Electronic) | 1611-3349 |
Funding
This work was supported by the National Natural Science Foundation of China (No. T2322012, No. 62172218), and the Guangdong Basic and Applied Basic Research Foundation (No. 2022A1515010170).