Abstract
Specular highlights distort appearance, obscure texture, and hinder geometric reasoning in both natural and surgical imagery. We present UnReflectAnything, an RGB-only framework that removes highlights from a single image by predicting a highlight map together with a reflection-free diffuse reconstruction. The model uses a frozen vision transformer encoder to extract multi-scale features, a lightweight head to localize specular regions, and a token-level inpainting module that restores corrupted feature patches before producing the final diffuse image. To overcome the lack of paired supervision, we introduce a Virtual Highlight Synthesis pipeline that renders physically plausible specularities using monocular geometry, Fresnel-aware shading, and randomized lighting which enables training on arbitrary RGB images with correct geometric structure. UnReflectAnything generalizes across natural and surgical domains where non-Lambertian surfaces and non-uniform lighting create severe highlights and it achieves competitive performance with state-of-the-art results on several benchmarks.
Physically-Grounded Specular Synthesis
A monocular geometry-aware pipeline based on MoGe-2 that renders Fresnel-modulated specularities from stochastic point-light sources. This enables robust supervision on unaligned RGB imagery by synthesizing physically plausible training pairs without requiring diffuse ground truth.
Latent Token-Space Inpainting
A transformer-based architecture designed to reconstruct specularly-corrupted patch tokens directly within the DINOv3 latent space. By leveraging long-range dependencies, the model recovers original diffuse features while maintaining global semantic consistency.
Foundation Model Integration
Harnessing frozen DINOv3 Vision Transformer backbones to extract high-level semantic priors. This integration provides invariant features that ensure zero-shot generalization across diverse lighting conditions and non-Lambertian material properties.
Cross-Domain Robustness
Proven efficacy across disparate visual domains, from unstructured natural scenes to endoscopic surgical environments. The framework effectively suppressed highlights in complex light-matter interactions where conventional methods degrade.
Unified Supervision Framework
A training strategy combining synthetic highlight rendering with multi-scale loss functions. By integrating token-level and image-level supervision, the model ensures precise highlight localization and seamless diffuse reconstruction.
Citation
@misc{rota2025unreflectanything,
title={UnReflectAnything: RGB-Only Highlight Removal by Rendering Synthetic Specular Supervision},
author={Alberto Rota and Mert Kiray and Mert Asim Karaoglu and Patrick Ruhkamp and Elena De Momi and Nassir Navab and Benjamin Busam},
year={2025},
eprint={2512.09583},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2512.09583},
}































