arXiv

Instant-Fold: In-Context Imitation Learning for Deformable Object Manipulation

June 4, 2026 · Yilong Wang, Cheng Qian, Edward Johns · Original Source

Title: Instant-Fold: Leveraging In-Context Imitation Learning for Manipulating Deformable Objects

Abstract: Manipulating deformable objects (DOM) presents significant difficulties due to the complexity of high-dimensional, partially observable states that change over long horizons through interactions that alter topology, often allowing for multiple valid manipulation strategies. To address these challenges, we present Instant-Fold, a novel in-context imitation learning framework designed for DOM tasks. By relying on a single human demonstration, our method enables the policy to infer and perform a variety of manipulation modes—ranging from different spatial executions to varied action orderings—without the need for gradient-based updates. The system initially acquires visual representations aware of deformations through temporal contrastive pretraining. Subsequently, a flow-matching transformer, conditioned on the provided demonstration, predicts the necessary actions to carry out the specified manipulation mode. Developed exclusively in simulation, Instant-Fold demonstrates robust generalization across various folding patterns and achieves zero-shot transfer to real-world environments, eliminating the need for further data collection or fine-tuning. For visual examples, please visit https://instant-fold.github.io.

Source: arXiv Generated at: 2026-06-04 00:00:00 UTC