Continual Visual and Verbal Learning Through a Child's Egocentric Input
Title: Fostering Continuous Visual and Verbal Acquisition via a Child’s Egocentric Perspective
Abstract
Children acquire vocabulary by interpreting a continuous, temporally ordered flow of egocentric experiences. While recent studies have demonstrated that neural networks can successfully learn word-referent associations from a child’s egocentric video footage, these models typically require cycling through shuffled data for hundreds of epochs. This approach diverges significantly from the natural, sequential manner in which children encounter their surroundings. To address this, we present BabyCL, a novel continual multimodal learning framework designed to process the SAYCam dataset in a single, chronological pass. BabyCL integrates streaming visual representation learning with an image-text contrastive objective. The architecture employs multi-stage temporal segmentation of the data stream alongside a dual replay buffer that separately maintains visual and multimodal histories. Training is conducted jointly using three contrastive losses on a shared backbone. Evaluated under an optimized budget, BabyCL surpasses streaming learning baselines on the SAYCam Labeled-S 4AFC benchmark, significantly reducing the performance gap relative to the upper bound achieved by offline training. Ablation studies confirm that these improvements remain robust regardless of the online temporal segmentation window size or the replay buffer’s eviction policy. Collectively, these findings indicate that meaningful word-referent mappings can develop under training conditions that closely mirror a child’s real-world experience.
Source: arXiv Generated at: 2026-06-04 00:00:00 UTC




