arXiv

WISE: A Multimodal Search Engine for Visual Scenes, Audio, Objects, Faces, Speech, and Metadata

June 2, 2026 · Prasanna Sridhar, Horace Lee, David M. S. Pinto, Andrew Zisserman, Abhishek Dutta · Original Source

Title: WISE: A Multimodal Search Engine for Visual Scenes, Audio, Objects, Faces, Speech, and Metadata

Abstract: This study introduces WISE, an open-source audiovisual search tool that consolidates diverse multimodal retrieval functionalities into a user-friendly platform designed for individuals without a background in machine learning. The system enables natural-language and reverse-image queries across both images and videos, operating at two distinct levels: scene-level searches (such as "empty street") and object-level searches (such as "horse"). Additionally, WISE offers face-based identification for specific persons, audio retrieval for acoustic events via text descriptions (e.g., "wood creak") or audio samples, and search capabilities within automatically transcribed speech. Users can also refine results using custom metadata.

By merging queries across different modalities, WISE unlocks rich insights. For instance, one can locate German trains in a historical archive by combining the object query "train" with the metadata query "Germany," or find a specific face within a particular location. Leveraging vector search technology, the engine is capable of scaling to efficiently manage millions of images or thousands of hours of video footage. Its modular design allows for the seamless integration of new models. WISE is suitable for local deployment, ensuring privacy for sensitive or proprietary collections, and has already been utilized in several practical applications. The project’s source code is publicly available at https://gitlab.com/vgg/wise/wise.

Source: arXiv Generated at: 2026-06-02 00:00:00 UTC