Abstract
Text-driven 3D editing is an emerging task that focuses on modifying scenes based on text prompts. Current methods often adapt pre-trained 2D image editors to multi-view observations, using specific strategies to combine information across views. However, these approaches still struggle with ensuring consistency across views, as they lack precise control over the sharing of information, resulting in edits with insufficient visual changes and blurry details. In this paper, we propose CoreEditor, a novel framework for consistent text-to-3D editing. At the core of our approach is a novel correspondence-constrained attention mechanism, which enforces structured interactions between corresponding pixels that are expected to remain visually consistent during the diffusion denoising process. Unlike conventional wisdom that relies solely on scene geometry, we enhance the correspondence by incorporating semantic similarity derived from the diffusion denoising process. This combined support from both geometry and semantics ensures a robust multi-view editing process. Additionally, we introduce a selective editing pipeline that enables users to choose their preferred edits from multiple candidates, creating a more flexible and user-centered 3D editing process. Extensive experiments demonstrate the effectiveness of CoreEditor, showing its ability to generate high-quality 3D edits, significantly outperforming existing methods.
| Original language | English |
|---|---|
| Number of pages | 15 |
| Journal | IEEE Transactions on Visualization and Computer Graphics |
| DOIs | |
| Publication status | E-pub ahead of print - 26 Jan 2026 |
Bibliographical note
Publisher Copyright:© 1995-2012 IEEE.
Keywords
- 3D Editing
- Gaussian Splatting
- Diffusion