AI Achieves Rapid, Natural Language-Based 3D Object Recognition in Just Five Seconds
UNIST Introduces LightSplat, a Fast and Lightweight 3D Scene Understanding System—Accepted for CVPR 2026
Abstract Open-vocabulary 3D scene understanding enables users to segment novel objects in complex 3D environments through natural language. However, existing approaches remain slow, memory-intensive, and overly complex due to iterative optimization and dense per-Gaussian feature assignments. To address this, we propose LightSplat, a fast and memory-efficient training-free framework that injects compact 2-byte semantic indices into 3D representations from multi-view images. By assigning semantic indices only to salient regions and managing them with a lightweight index-feature mapping, LightSplat eliminates costly feature optimization and storage overhead. We further ensure semantic consistency and efficient inference via single-step clustering that links geometrically and semantically related masks in 3D. We evaluate our method on LERF-OVS, ScanNet, and DL3DV-OVS across complex indoor-outdoor scenes. As a result, LightSplat achieves state-of-the-art performance with up to 50-400x speedup and 64x lower memory, enabling scalable language-driven 3D understanding. A research team affiliated with UNIST has introduced a new AI system capable of instantly recognizing objects within 3D environments based solely on natural language descriptions. Whether in augmented reality interfaces or robotic systems, users can effortlessly specify target items—such as white sofas or tea in a glass—and receive precise localization within seconds. This approach is robust across diverse queries, from object details to complex spatial semantics, enabling fast and scalable performance for real-world applications. Led by Professor Kyungdon Joo from the UNIST Graduate School of Artificial Intelligence, the team developed ' LightSplat ,' a method that drastically reduces the memory and processing power required for open-vocabulary 3D scene understanding. Unlike previous systems that relied on complex, slow optimization processes, LightSplat assigns compact 2-byte labels to key regions in the scene. These labels enable rapid identification of objects based on natural language inputs. Traditional recognition methods depend on predefined object categories—such as chairs or doors—which limits their flexibility. In contrast, LightSplat can understand and locate highly specific objects described in everyday language. This makes it ideal for applications requiring quick, intuitive interaction, including robotics, augmented reality, and digital twins. By storing minimal semantic information linked to selected points within the scene, LightSplat reduces memory usage by 98%. Its streamlined approach allows the system to find and highlight objects in real time, matching human language with 3D space in just five seconds—50 to 400 times faster than previous techniques. Testing on datasets such as LERF-OVS, DL3DV-OVS, and ScanNet demonstrated LightSplat's capability to identify objects of varying sizes and complexities—from small items like eggs and cups to large objects like cars and furniture. On the ScanNet dataset, it achieved a segmentation accuracy score of 37.11, confirming reliable recognition performance. First author Jaehun Bang explained, "Speed, accuracy, and efficient memory use are all critical. Our technology makes open-vocabulary 3D recognition viable for real-world applications." Professor Joo added, "This system could power robots that understand natural commands instantly, facilitate AR content creation through simple text-based object selection, and support digital twin environments that accurately and efficiently reflect real-world scenes." The research has been accepted for presentation at the 2026 Conference on Computer Vision and Pattern Recognition (CVPR) in Denver, Colorado, from June 3 to 5, 2026. It was supported by the Institute of Information & Communications Technology Planning & Evaluation (IITP) and the UNIST Graduate School of Artificial Intelligence. Additional support came from initiatives including the AI Star Fellowship Program, the LG AI STAR Talent Development Program for Leading Large-Scale Generative AI Models in the Physical AI Domain, and the InnoCORE program of the Ministry of Science and ICT (MSIT). Journal Reference Jaehun Bang, Jinhyeok Kim, Minji Kim, et al., "LightSplat: Fast and Memory-Efficient Open-Vocabulary 3D Scene Understanding in Five Seconds," 26' CVPR, (2026).
- 2026-06-12
- JooHyeon Heo
- 89