Robotics 43
☆ Dropping the D: RGB-D SLAM Without the Depth Sensor
We present DropD-SLAM, a real-time monocular SLAM system that achieves
RGB-D-level accuracy without relying on depth sensors. The system replaces
active depth input with three pretrained vision modules: a monocular metric
depth estimator, a learned keypoint detector, and an instance segmentation
network. Dynamic objects are suppressed using dilated instance masks, while
static keypoints are assigned predicted depth values and backprojected into 3D
to form metrically scaled features. These are processed by an unmodified RGB-D
SLAM back end for tracking and mapping. On the TUM RGB-D benchmark, DropD-SLAM
attains 7.4 cm mean ATE on static sequences and 1.8 cm on dynamic sequences,
matching or surpassing state-of-the-art RGB-D methods while operating at 22 FPS
on a single GPU. These results suggest that modern pretrained vision models can
replace active depth sensors as reliable, real-time sources of metric scale,
marking a step toward simpler and more cost-effective SLAM systems.
★ EmbodiedCoder: Parameterized Embodied Mobile Manipulation via Modern Coding Model
Zefu Lin, Rongxu Cui, Chen Hanning, Xiangyu Wang, Junjia Xu, Xiaojuan Jin, Chen Wenbo, Hui Zhou, Lue Fan, Wenling Li, Zhaoxiang Zhang
Recent advances in control robot methods, from end-to-end
vision-language-action frameworks to modular systems with predefined
primitives, have advanced robots' ability to follow natural language
instructions. Nonetheless, many approaches still struggle to scale to diverse
environments, as they often rely on large annotated datasets and offer limited
interpretability.In this work, we introduce EmbodiedCoder, a training-free
framework for open-world mobile robot manipulation that leverages coding models
to directly generate executable robot trajectories. By grounding high-level
instructions in code, EmbodiedCoder enables flexible object geometry
parameterization and manipulation trajectory synthesis without additional data
collection or fine-tuning.This coding-based paradigm provides a transparent and
generalizable way to connect perception with manipulation. Experiments on real
mobile robots show that EmbodiedCoder achieves robust performance across
diverse long-term tasks and generalizes effectively to novel objects and
environments.Our results demonstrate an interpretable approach for bridging
high-level reasoning and low-level control, moving beyond fixed primitives
toward versatile robot intelligence. See the project page at:
https://anonymous.4open.science/w/Embodied-Coder/
comment: Demo Page: https://anonymous.4open.science/w/Embodied-Coder/
☆ DYMO-Hair: Generalizable Volumetric Dynamics Modeling for Robot Hair Manipulation
Chengyang Zhao, Uksang Yoo, Arkadeep Narayan Chaudhury, Giljoo Nam, Jonathan Francis, Jeffrey Ichnowski, Jean Oh
Hair care is an essential daily activity, yet it remains inaccessible to
individuals with limited mobility and challenging for autonomous robot systems
due to the fine-grained physical structure and complex dynamics of hair. In
this work, we present DYMO-Hair, a model-based robot hair care system. We
introduce a novel dynamics learning paradigm that is suited for volumetric
quantities such as hair, relying on an action-conditioned latent state editing
mechanism, coupled with a compact 3D latent space of diverse hairstyles to
improve generalizability. This latent space is pre-trained at scale using a
novel hair physics simulator, enabling generalization across previously unseen
hairstyles. Using the dynamics model with a Model Predictive Path Integral
(MPPI) planner, DYMO-Hair is able to perform visual goal-conditioned hair
styling. Experiments in simulation demonstrate that DYMO-Hair's dynamics model
outperforms baselines on capturing local deformation for diverse, unseen
hairstyles. DYMO-Hair further outperforms baselines in closed-loop hair styling
tasks on unseen hairstyles, with an average of 22% lower final geometric error
and 42% higher success rate than the state-of-the-art system. Real-world
experiments exhibit zero-shot transferability of our system to wigs, achieving
consistent success on challenging unseen hairstyles where the state-of-the-art
system fails. Together, these results introduce a foundation for model-based
robot hair care, advancing toward more generalizable, flexible, and accessible
robot hair styling in unconstrained physical environments. More details are
available on our project page: https://chengyzhao.github.io/DYMOHair-web/.
comment: Project page: https://chengyzhao.github.io/DYMOHair-web/
☆ A Preview of HoloOcean 2.0 ICRA 2025
Marine robotics simulators play a fundamental role in the development of
marine robotic systems. With increased focus on the marine robotics field in
recent years, there has been significant interest in developing higher
fidelitysimulation of marine sensors, physics, and visual rendering
capabilities to support autonomous marine robot development and validation.
HoloOcean 2.0, the next major release of HoloOcean, brings state-of-the-art
features under a general marine simulator capable of supporting a variety of
tasks. New features in HoloOcean 2.0 include migration to Unreal Engine (UE)
5.3, advanced vehicle dynamics using models from Fossen, and support for ROS2
using a custom bridge. Additional features are currently in development,
including significantly more efficient ray tracing-based sidescan,
forward-looking, and bathymetric sonar implementations; semantic sensors;
environment generation tools; volumetric environmental effects; and realistic
waves.
comment: 5 pages, 9 figures, submitted to the ICRA 2025 aq2uasim workshop
☆ Vision-Guided Targeted Grasping and Vibration for Robotic Pollination in Controlled Environments
Jaehwan Jeong, Tuan-Anh Vu, Radha Lahoti, Jiawen Wang, Vivek Alumootil, Sangpil Kim, M. Khalid Jawed
Robotic pollination offers a promising alternative to manual labor and
bumblebee-assisted methods in controlled agriculture, where wind-driven
pollination is absent and regulatory restrictions limit the use of commercial
pollinators. In this work, we present and validate a vision-guided robotic
framework that uses data from an end-effector mounted RGB-D sensor and combines
3D plant reconstruction, targeted grasp planning, and physics-based vibration
modeling to enable precise pollination. First, the plant is reconstructed in 3D
and registered to the robot coordinate frame to identify obstacle-free grasp
poses along the main stem. Second, a discrete elastic rod model predicts the
relationship between actuation parameters and flower dynamics, guiding the
selection of optimal pollination strategies. Finally, a manipulator with soft
grippers grasps the stem and applies controlled vibrations to induce pollen
release. End-to-end experiments demonstrate a 92.5\% main-stem grasping success
rate, and simulation-guided optimization of vibration parameters further
validates the feasibility of our approach, ensuring that the robot can safely
and effectively perform pollination without damaging the flower. To our
knowledge, this is the first robotic system to jointly integrate vision-based
grasping and vibration modeling for automated precision pollination.
☆ Towards Autonomous Tape Handling for Robotic Wound Redressing
Chronic wounds, such as diabetic, pressure, and venous ulcers, affect over
6.5 million patients in the United States alone and generate an annual cost
exceeding \$25 billion. Despite this burden, chronic wound care remains a
routine yet manual process performed exclusively by trained clinicians due to
its critical safety demands. We envision a future in which robotics and
automation support wound care to lower costs and enhance patient outcomes. This
paper introduces an autonomous framework for one of the most fundamental yet
challenging subtasks in wound redressing: adhesive tape manipulation.
Specifically, we address two critical capabilities: tape initial detachment
(TID) and secure tape placement. To handle the complex adhesive dynamics of
detachment, we propose a force-feedback imitation learning approach trained
from human teleoperation demonstrations. For tape placement, we develop a
numerical trajectory optimization method based to ensure smooth adhesion and
wrinkle-free application across diverse anatomical surfaces. We validate these
methods through extensive experiments, demonstrating reliable performance in
both quantitative evaluations and integrated wound redressing pipelines. Our
results establish tape manipulation as an essential step toward practical
robotic wound care automation.
☆ Multi-Robot Distributed Optimization for Exploration and Mapping of Unknown Environments using Bioinspired Tactile-Sensor
This project proposes a bioinspired multi-robot system using Distributed
Optimization for efficient exploration and mapping of unknown environments.
Each robot explores its environment and creates a map, which is afterwards put
together to form a global 2D map of the environment. Inspired by wall-following
behaviors, each robot autonomously explores its neighborhood based on a tactile
sensor, similar to the antenna of a cockroach, mounted on the surface of the
robot. Instead of avoiding obstacles, robots log collision points when they
touch obstacles. This decentralized control strategy ensures effective task
allocation and efficient exploration of unknown terrains, with applications in
search and rescue, industrial inspection, and environmental monitoring. The
approach was validated through experiments using e-puck robots in a simulated
1.5 x 1.5 m environment with three obstacles. The results demonstrated the
system's effectiveness in achieving high coverage, minimizing collisions, and
constructing accurate 2D maps.
☆ Cross-Embodiment Dexterous Hand Articulation Generation via Morphology-Aware Learning
Dexterous grasping with multi-fingered hands remains challenging due to
high-dimensional articulations and the cost of optimization-based pipelines.
Existing end-to-end methods require training on large-scale datasets for
specific hands, limiting their ability to generalize across different
embodiments. We propose an eigengrasp-based, end-to-end framework for
cross-embodiment grasp generation. From a hand's morphology description, we
derive a morphology embedding and an eigengrasp set. Conditioned on these,
together with the object point cloud and wrist pose, an amplitude predictor
regresses articulation coefficients in a low-dimensional space, which are
decoded into full joint articulations. Articulation learning is supervised with
a Kinematic-Aware Articulation Loss (KAL) that emphasizes fingertip-relevant
motions and injects morphology-specific structure. In simulation on unseen
objects across three dexterous hands, our model attains a 91.9% average grasp
success rate with less than 0.4 seconds inference per grasp. With few-shot
adaptation to an unseen hand, it achieves 85.6% success on unseen objects in
simulation, and real-world experiments on this few-shot generalized hand
achieve an 87% success rate. The code and additional materials will be made
available upon publication on our project website
https://connor-zh.github.io/cross_embodiment_dexterous_grasping.
☆ Hybrid Quantum-Classical Policy Gradient for Adaptive Control of Cyber-Physical Systems: A Comparative Study of VQC vs. MLP
The comparative evaluation between classical and quantum reinforcement
learning (QRL) paradigms was conducted to investigate their convergence
behavior, robustness under observational noise, and computational efficiency in
a benchmark control environment. The study employed a multilayer perceptron
(MLP) agent as a classical baseline and a parameterized variational quantum
circuit (VQC) as a quantum counterpart, both trained on the CartPole-v1
environment over 500 episodes. Empirical results demonstrated that the
classical MLP achieved near-optimal policy convergence with a mean return of
498.7 +/- 3.2, maintaining stable equilibrium throughout training. In contrast,
the VQC exhibited limited learning capability, with an average return of 14.6
+/- 4.8, primarily constrained by circuit depth and qubit connectivity. Noise
robustness analysis further revealed that the MLP policy deteriorated
gracefully under Gaussian perturbations, while the VQC displayed higher
sensitivity at equivalent noise levels. Despite the lower asymptotic
performance, the VQC exhibited significantly lower parameter count and
marginally increased training time, highlighting its potential scalability for
low-resource quantum processors. The results suggest that while classical
neural policies remain dominant in current control benchmarks, quantum-enhanced
architectures could offer promising efficiency advantages once hardware noise
and expressivity limitations are mitigated.
comment: 6 pages, 5 figures, 2 tables, 17 equations, 1 algorithm
☆ Information-Theoretic Policy Pre-Training with Empowerment
Moritz Schneider, Robert Krug, Narunas Vaskevicius, Luigi Palmieri, Michael Volpp, Joschka Boedecker
Empowerment, an information-theoretic measure of an agent's potential
influence on its environment, has emerged as a powerful intrinsic motivation
and exploration framework for reinforcement learning (RL). Besides for
unsupervised RL and skill learning algorithms, the specific use of empowerment
as a pre-training signal has received limited attention in the literature. We
show that empowerment can be used as a pre-training signal for data-efficient
downstream task adaptation. For this we extend the traditional notion of
empowerment by introducing discounted empowerment, which balances the agent's
control over the environment across short- and long-term horizons. Leveraging
this formulation, we propose a novel pre-training paradigm that initializes
policies to maximize discounted empowerment, enabling agents to acquire a
robust understanding of environmental dynamics. We analyze empowerment-based
pre-training for various existing RL algorithms and empirically demonstrate its
potential as a general-purpose initialization strategy: empowerment-maximizing
policies with long horizons are data-efficient and effective, leading to
improved adaptability in downstream tasks. Our findings pave the way for future
research to scale this framework to high-dimensional and complex tasks, further
advancing the field of RL.
☆ Coordinate-Consistent Localization via Continuous-Time Calibration and Fusion of UWB and SLAM Observations
Onboard simultaneous localization and mapping (SLAM) methods are commonly
used to provide accurate localization information for autonomous robots.
However, the coordinate origin of SLAM estimate often resets for each run. On
the other hand, UWB-based localization with fixed anchors can ensure a
consistent coordinate reference across sessions; however, it requires an
accurate assignment of the anchor nodes' coordinates. To this end, we propose a
two-stage approach that calibrates and fuses UWB data and SLAM data to achieve
coordinate-wise consistent and accurate localization in the same environment.
In the first stage, we solve a continuous-time batch optimization problem by
using the range and odometry data from one full run, incorporating height
priors and anchor-to-anchor distance factors to recover the anchors' 3D
positions. For the subsequent runs in the second stage, a sliding-window
optimization scheme fuses the UWB and SLAM data, which facilitates accurate
localization in the same coordinate system. Experiments are carried out on the
NTU VIRAL dataset with six scenarios of UAV flight, and we show that
calibration using data in one run is sufficient to enable accurate localization
in the remaining runs. We release our source code to benefit the community at
https://github.com/ntdathp/slam-uwb-calibration.
☆ AI-Enabled Capabilities to Facilitate Next-Generation Rover Surface Operations
Current planetary rovers operate at traverse speeds of approximately 10 cm/s,
fundamentally limiting exploration efficiency. This work presents integrated AI
systems which significantly improve autonomy through three components: (i) the
FASTNAV Far Obstacle Detector (FOD), capable of facilitating sustained 1.0 m/s
speeds via computer vision-based obstacle detection; (ii) CISRU, a multi-robot
coordination framework enabling human-robot collaboration for in-situ resource
utilisation; and (iii) the ViBEKO and AIAXR deep learning-based terrain
classification studies. Field validation in Mars analogue environments
demonstrated these systems at Technology Readiness Level 4, providing
measurable improvements in traverse speed, classification accuracy, and
operational safety for next-generation planetary missions.
comment: Paper for 18th Symposium on Advanced Space Technologies in Robotics
and Automation (ASTRA), presented on October 7th at Leiden, Netherlands
☆ The DISTANT Design for Remote Transmission and Steering Systems for Planetary Robotics
Cristina Luna, Alba Guerra, Almudena Moreno, Manuel Esquer, Willy Roa, Mateusz Krawczak, Robert Popela, Piotr Osica, Davide Nicolis
Planetary exploration missions require robust locomotion systems capable of
operating in extreme environments over extended periods. This paper presents
the DISTANT (Distant Transmission and Steering Systems) design, a novel
approach for relocating rover traction and steering actuators from
wheel-mounted positions to a thermally protected warm box within the rover
body. The design addresses critical challenges in long-distance traversal
missions by protecting sensitive components from thermal cycling, dust
contamination, and mechanical wear. A double wishbone suspension configuration
with cardan joints and capstan drive steering has been selected as the optimal
architecture following comprehensive trade-off analysis. The system enables
independent wheel traction, steering control, and suspension management whilst
maintaining all motorisation within the protected environment. The design meets
a 50 km traverse requirement without performance degradation, with integrated
dust protection mechanisms and thermal management solutions. Testing and
validation activities are planned for Q1 2026 following breadboard
manufacturing at 1:3 scale.
comment: Paper for 18th Symposium on Advanced Space Technologies in Robotics
and Automation (ASTRA), presented on October 7th at Leiden, Netherlands
☆ Learning to Crawl: Latent Model-Based Reinforcement Learning for Soft Robotic Adaptive Locomotion
Soft robotic crawlers are mobile robots that utilize soft body deformability
and compliance to achieve locomotion through surface contact. Designing control
strategies for such systems is challenging due to model inaccuracies, sensor
noise, and the need to discover locomotor gaits. In this work, we present a
model-based reinforcement learning (MB-RL) framework in which latent dynamics
inferred from onboard sensors serve as a predictive model that guides an
actor-critic algorithm to optimize locomotor policies. We evaluate the
framework on a minimal crawler model in simulation using inertial measurement
units and time-of-flight sensors as observations. The learned latent dynamics
enable short-horizon motion prediction while the actor-critic discovers
effective locomotor policies. This approach highlights the potential of
latent-dynamics MB-RL for enabling embodied soft robotic adaptive locomotion
based solely on noisy sensor feedback.
☆ A Co-Design Framework for Energy-Aware Monoped Jumping with Detailed Actuator Modeling
A monoped's jump height and energy consumption depend on both, its mechanical
design and control strategy. Existing co-design frameworks typically optimize
for either maximum height or minimum energy, neglecting their trade-off. They
also often omit gearbox parameter optimization and use oversimplified actuator
mass models, producing designs difficult to replicate in practice. In this
work, we introduce a novel three-stage co-design optimization framework that
jointly maximizes jump height while minimizing mechanical energy consumption of
a monoped. The proposed method explicitly incorporates realistic actuator mass
models and optimizes mechanical design (including gearbox) and control
parameters within a unified framework. The resulting design outputs are then
used to automatically generate a parameterized CAD model suitable for direct
fabrication, significantly reducing manual design iterations. Our experimental
evaluations demonstrate a 50 percent reduction in mechanical energy consumption
compared to the baseline design, while achieving a jump height of 0.8m. Video
presentation is available at http://y2u.be/XW8IFRCcPgM
comment: 7 pages, 8 figures, 1 table, Accepted at IEEE-RAS 24th International
Conference on Humanoid Robots (Humanoids) 2025, Aman Singh, Aastha Mishra -
Authors contributed equally
★ The Safety Challenge of World Models for Embodied AI Agents: A Review
Lorenzo Baraldi, Zifan Zeng, Chongzhe Zhang, Aradhana Nayak, Hongbo Zhu, Feng Liu, Qunli Zhang, Peng Wang, Shiming Liu, Zheng Hu, Angelo Cangelosi, Lorenzo Baraldi
The rapid progress in embodied artificial intelligence has highlighted the
necessity for more advanced and integrated models that can perceive, interpret,
and predict environmental dynamics. In this context, World Models (WMs) have
been introduced to provide embodied agents with the abilities to anticipate
future environmental states and fill in knowledge gaps, thereby enhancing
agents' ability to plan and execute actions. However, when dealing with
embodied agents it is fundamental to ensure that predictions are safe for both
the agent and the environment. In this article, we conduct a comprehensive
literature review of World Models in the domains of autonomous driving and
robotics, with a specific focus on the safety implications of scene and control
generation tasks. Our review is complemented by an empirical analysis, wherein
we collect and examine predictions from state-of-the-art models, identify and
categorize common faults (herein referred to as pathologies), and provide a
quantitative evaluation of the results.
☆ VCoT-Grasp: Grasp Foundation Models with Visual Chain-of-Thought Reasoning for Language-driven Grasp Generation
Haoran Zhang, Shuanghao Bai, Wanqi Zhou, Yuedi Zhang, Qi Zhang, Pengxiang Ding, Cheng Chi, Donglin Wang, Badong Chen
Robotic grasping is one of the most fundamental tasks in robotic
manipulation, and grasp detection/generation has long been the subject of
extensive research. Recently, language-driven grasp generation has emerged as a
promising direction due to its practical interaction capabilities. However,
most existing approaches either lack sufficient reasoning and generalization
capabilities or depend on complex modular pipelines. Moreover, current grasp
foundation models tend to overemphasize dialog and object semantics, resulting
in inferior performance and restriction to single-object grasping. To maintain
strong reasoning ability and generalization in cluttered environments, we
propose VCoT-Grasp, an end-to-end grasp foundation model that incorporates
visual chain-of-thought reasoning to enhance visual understanding for grasp
generation. VCoT-Grasp adopts a multi-turn processing paradigm that dynamically
focuses on visual inputs while providing interpretable reasoning traces. For
training, we refine and introduce a large-scale dataset, VCoT-GraspSet,
comprising 167K synthetic images with over 1.36M grasps, as well as 400+
real-world images with more than 1.2K grasps, annotated with intermediate
bounding boxes. Extensive experiments on both VCoT-GraspSet and real robot
demonstrate that our method significantly improves grasp success rates and
generalizes effectively to unseen objects, backgrounds, and distractors. More
details can be found at https://zhanghr2001.github.io/VCoT-Grasp.github.io.
☆ Human-in-the-loop Optimisation in Robot-assisted Gait Training
Wearable robots offer a promising solution for quantitatively monitoring gait
and providing systematic, adaptive assistance to promote patient independence
and improve gait. However, due to significant interpersonal and intrapersonal
variability in walking patterns, it is important to design robot controllers
that can adapt to the unique characteristics of each individual. This paper
investigates the potential of human-in-the-loop optimisation (HILO) to deliver
personalised assistance in gait training. The Covariance Matrix Adaptation
Evolution Strategy (CMA-ES) was employed to continuously optimise an
assist-as-needed controller of a lower-limb exoskeleton. Six healthy
individuals participated over a two-day experiment. Our results suggest that
while the CMA-ES appears to converge to a unique set of stiffnesses for each
individual, no measurable impact on the subjects' performance was observed
during the validation trials. These findings highlight the impact of
human-robot co-adaptation and human behaviour variability, whose effect may be
greater than potential benefits of personalising rule-based assistive
controllers. Our work contributes to understanding the limitations of current
personalisation approaches in exoskeleton-assisted gait rehabilitation and
identifies key challenges for effective implementation of human-in-the-loop
optimisation in this domain.
☆ Precise and Efficient Collision Prediction under Uncertainty in Autonomous Driving ICRA 2026
This research introduces two efficient methods to estimate the collision risk
of planned trajectories in autonomous driving under uncertain driving
conditions. Deterministic collision checks of planned trajectories are often
inaccurate or overly conservative, as noisy perception, localization errors,
and uncertain predictions of other traffic participants introduce significant
uncertainty into the planning process. This paper presents two semi-analytic
methods to compute the collision probability of planned trajectories with
arbitrary convex obstacles. The first approach evaluates the probability of
spatial overlap between an autonomous vehicle and surrounding obstacles, while
the second estimates the collision probability based on stochastic boundary
crossings. Both formulations incorporate full state uncertainties, including
position, orientation, and velocity, and achieve high accuracy at computational
costs suitable for real-time planning. Simulation studies verify that the
proposed methods closely match Monte Carlo results while providing significant
runtime advantages, enabling their use in risk-aware trajectory planning. The
collision estimation methods are available as open-source software:
https://github.com/TUM-AVS/Collision-Probability-Estimation
comment: 8 pages, submitted to the IEEE ICRA 2026, Vienna, Austria
☆ Federated Split Learning for Resource-Constrained Robots in Industrial IoT: Framework Comparison, Optimization Strategies, and Future Directions
Federated split learning (FedSL) has emerged as a promising paradigm for
enabling collaborative intelligence in industrial Internet of Things (IoT)
systems, particularly in smart factories where data privacy, communication
efficiency, and device heterogeneity are critical concerns. In this article, we
present a comprehensive study of FedSL frameworks tailored for
resource-constrained robots in industrial scenarios. We compare synchronous,
asynchronous, hierarchical, and heterogeneous FedSL frameworks in terms of
workflow, scalability, adaptability, and limitations under dynamic industrial
conditions. Furthermore, we systematically categorize token fusion strategies
into three paradigms: input-level (pre-fusion), intermediate-level
(intra-fusion), and output-level (post-fusion), and summarize their respective
strengths in industrial applications. We also provide adaptive optimization
techniques to enhance the efficiency and feasibility of FedSL implementation,
including model compression, split layer selection, computing frequency
allocation, and wireless resource management. Simulation results validate the
performance of these frameworks under industrial detection scenarios. Finally,
we outline open issues and research directions of FedSL in future smart
manufacturing systems.
comment: 9 pages, 5 figures, submitted to the IEEE magazine
☆ Stable Robot Motions on Manifolds: Learning Lyapunov-Constrained Neural Manifold ODEs
Learning stable dynamical systems from data is crucial for safe and reliable
robot motion planning and control. However, extending stability guarantees to
trajectories defined on Riemannian manifolds poses significant challenges due
to the manifold's geometric constraints. To address this, we propose a general
framework for learning stable dynamical systems on Riemannian manifolds using
neural ordinary differential equations. Our method guarantees stability by
projecting the neural vector field evolving on the manifold so that it strictly
satisfies the Lyapunov stability criterion, ensuring stability at every system
state. By leveraging a flexible neural parameterisation for both the base
vector field and the Lyapunov function, our framework can accurately represent
complex trajectories while respecting manifold constraints by evolving
solutions directly on the manifold. We provide an efficient training strategy
for applying our framework and demonstrate its utility by solving Riemannian
LASA datasets on the unit quaternion (S^3) and symmetric positive-definite
matrix manifolds, as well as robotic motions evolving on \mathbb{R}^3 \times
S^3. We demonstrate the performance, scalability, and practical applicability
of our approach through extensive simulations and by learning robot motions in
a real-world experiment.
comment: 12 pages, 6 figures
☆ Oracle-Guided Masked Contrastive Reinforcement Learning for Visuomotor Policies
A prevailing approach for learning visuomotor policies is to employ
reinforcement learning to map high-dimensional visual observations directly to
action commands. However, the combination of high-dimensional visual inputs and
agile maneuver outputs leads to long-standing challenges, including low sample
efficiency and significant sim-to-real gaps. To address these issues, we
propose Oracle-Guided Masked Contrastive Reinforcement Learning (OMC-RL), a
novel framework designed to improve the sample efficiency and asymptotic
performance of visuomotor policy learning. OMC-RL explicitly decouples the
learning process into two stages: an upstream representation learning stage and
a downstream policy learning stage. In the upstream stage, a masked Transformer
module is trained with temporal modeling and contrastive learning to extract
temporally-aware and task-relevant representations from sequential visual
inputs. After training, the learned encoder is frozen and used to extract
visual representations from consecutive frames, while the Transformer module is
discarded. In the downstream stage, an oracle teacher policy with privileged
access to global state information supervises the agent during early training
to provide informative guidance and accelerate early policy learning. This
guidance is gradually reduced to allow independent exploration as training
progresses. Extensive experiments in simulated and real-world environments
demonstrate that OMC-RL achieves superior sample efficiency and asymptotic
policy performance, while also improving generalization across diverse and
perceptually complex scenarios.
☆ D2E: Scaling Vision-Action Pretraining on Desktop Data for Transfer to Embodied AI
Suwhan Choi, Jaeyoon Jung, Haebin Seong, Minchan Kim, Minyeong Kim, Yongjun Cho, Yoonshik Kim, Yubeen Park, Youngjae Yu, Yunsung Lee
Large language models leverage internet-scale text data, yet embodied AI
remains constrained by the prohibitive costs of physical trajectory collection.
Desktop environments -- particularly gaming -- offer a compelling alternative:
they provide rich sensorimotor interactions at scale while maintaining the
structured observation-action coupling essential for embodied learning. We
present D2E (Desktop to Embodied AI), a framework that demonstrates desktop
interactions can serve as an effective pretraining substrate for robotics
embodied AI tasks. Unlike prior work that remained domain-specific (e.g., VPT
for Minecraft) or kept data proprietary (e.g., SIMA), D2E establishes a
complete pipeline from scalable desktop data collection to verified transfer in
embodied domains. Our framework comprises three components: (1) the OWA Toolkit
that unifies diverse desktop interactions into a standardized format with 152x
compression, (2) the Generalist-IDM that achieves strong zero-shot
generalization across unseen games through timestamp-based event prediction,
enabling internet-scale pseudo-labeling, and (3) VAPT that transfers
desktop-pretrained representations to physical manipulation and navigation.
Using 1.3K+ hours of data (259 hours of human demonstrations, and 1K+ hours of
pseudo-labeled gameplay), we achieve a total of 96.6% success rate on LIBERO
manipulation and 83.3% on CANVAS navigation benchmarks. This validates that
sensorimotor primitives in digital interactions exhibit sufficient invariance
to transfer meaningfully to physical embodied tasks, establishing desktop
pretraining as a practical paradigm for robotics. We will make all our work
public, including the OWA toolkit, datasets of human-collected and
pseudo-labeled, and VAPT-trained models available at
https://worv-ai.github.io/d2e/
☆ Verifier-free Test-Time Sampling for Vision Language Action Models
Vision-Language-Action models (VLAs) have demonstrated remarkable performance
in robot control. However, they remain fundamentally limited in tasks that
require high precision due to their single-inference paradigm. While test-time
scaling approaches using external verifiers have shown promise, they require
additional training and fail to generalize to unseen conditions. We propose
Masking Distribution Guided Selection (MG-Select), a novel test-time scaling
framework for VLAs that leverages the model's internal properties without
requiring additional training or external modules. Our approach utilizes KL
divergence from a reference action token distribution as a confidence metric
for selecting the optimal action from multiple candidates. We introduce a
reference distribution generated by the same VLA but with randomly masked
states and language conditions as inputs, ensuring maximum uncertainty while
remaining aligned with the target task distribution. Additionally, we propose a
joint training strategy that enables the model to learn both conditional and
unconditional distributions by applying dropout to state and language
conditions, thereby further improving the quality of the reference
distribution. Our experiments demonstrate that MG-Select achieves significant
performance improvements, including a 28%/35% improvement in real-world
in-distribution/out-of-distribution tasks, along with a 168% relative gain on
RoboCasa pick-and-place tasks trained with 30 demonstrations.
comment: 14 pages; 3 figures
☆ DeLTa: Demonstration and Language-Guided Novel Transparent Object Manipulation
Taeyeop Lee, Gyuree Kang, Bowen Wen, Youngho Kim, Seunghyeok Back, In So Kweon, David Hyunchul Shim, Kuk-Jin Yoon
Despite the prevalence of transparent object interactions in human everyday
life, transparent robotic manipulation research remains limited to
short-horizon tasks and basic grasping capabilities.Although some methods have
partially addressed these issues, most of them have limitations in
generalizability to novel objects and are insufficient for precise long-horizon
robot manipulation. To address this limitation, we propose DeLTa (Demonstration
and Language-Guided Novel Transparent Object Manipulation), a novel framework
that integrates depth estimation, 6D pose estimation, and vision-language
planning for precise long-horizon manipulation of transparent objects guided by
natural task instructions. A key advantage of our method is its
single-demonstration approach, which generalizes 6D trajectories to novel
transparent objects without requiring category-level priors or additional
training. Additionally, we present a task planner that refines the
VLM-generated plan to account for the constraints of a single-arm, eye-in-hand
robot for long-horizon object manipulation tasks. Through comprehensive
evaluation, we demonstrate that our method significantly outperforms existing
transparent object manipulation approaches, particularly in long-horizon
scenarios requiring precise manipulation capabilities. Project page:
https://sites.google.com/view/DeLTa25/
comment: Project page: https://sites.google.com/view/DeLTa25/
☆ MetaVLA: Unified Meta Co-training For Efficient Embodied Adaption
Chen Li, Zhantao Yang, Han Zhang, Fangyi Chen, Chenchen Zhu, Anudeepsekhar Bolimera, Marios Savvides
Vision-Language-Action (VLA) models show promise in embodied reasoning, yet
remain far from true generalists-they often require task-specific fine-tuning,
and generalize poorly to unseen tasks. We propose MetaVLA, a unified,
backbone-agnostic post-training framework for efficient and scalable alignment.
MetaVLA introduces Context-Aware Meta Co-Training, which consolidates diverse
target tasks into a single fine-tuning stage while leveraging structurally
diverse auxiliary tasks to improve in-domain generalization. Unlike naive
multi-task SFT, MetaVLA integrates a lightweight meta-learning
mechanism-derived from Attentive Neural Processes-to enable rapid adaptation
from diverse contexts with minimal architectural change or inference overhead.
On the LIBERO benchmark, MetaVLA with six auxiliary tasks outperforms OpenVLA
by up to 8.0% on long-horizon tasks, reduces training steps from 240K to 75K,
and cuts GPU time by ~76%. These results show that scalable, low-resource
post-training is achievable-paving the way toward general-purpose embodied
agents. Code will be available.
☆ GO-Flock: Goal-Oriented Flocking in 3D Unknown Environments with Depth Maps
Yan Rui Tan, Wenqi Liu, Wai Lun Leong, John Guan Zhong Tan, Wayne Wen Huei Yong, Fan Shi, Rodney Swee Huat Teo
Artificial Potential Field (APF) methods are widely used for reactive
flocking control, but they often suffer from challenges such as deadlocks and
local minima, especially in the presence of obstacles. Existing solutions to
address these issues are typically passive, leading to slow and inefficient
collective navigation. As a result, many APF approaches have only been
validated in obstacle-free environments or simplified, pseudo 3D simulations.
This paper presents GO-Flock, a hybrid flocking framework that integrates
planning with reactive APF-based control. GO-Flock consists of an upstream
Perception Module, which processes depth maps to extract waypoints and virtual
agents for obstacle avoidance, and a downstream Collective Navigation Module,
which applies a novel APF strategy to achieve effective flocking behavior in
cluttered environments. We evaluate GO-Flock against passive APF-based
approaches to demonstrate their respective merits, such as their flocking
behavior and the ability to overcome local minima. Finally, we validate
GO-Flock through obstacle-filled environment and also hardware-in-the-loop
experiments where we successfully flocked a team of nine drones, six physical
and three virtual, in a forest environment.
☆ ARRC: Advanced Reasoning Robot Control - Knowledge-Driven Autonomous Manipulation Using Retrieval-Augmented Generation
We present ARRC (Advanced Reasoning Robot Control), a practical system that
connects natural-language instructions to safe local robotic control by
combining Retrieval-Augmented Generation (RAG) with RGB-D perception and
guarded execution on an affordable robot arm. The system indexes curated robot
knowledge (movement patterns, task templates, and safety heuristics) in a
vector database, retrieves task-relevant context for each instruction, and
conditions a large language model (LLM) to produce JSON-structured action
plans. Plans are executed on a UFactory xArm 850 fitted with a Dynamixel-driven
parallel gripper and an Intel RealSense D435 camera. Perception uses AprilTag
detections fused with depth to produce object-centric metric poses. Execution
is enforced via software safety gates: workspace bounds, speed and force caps,
timeouts, and bounded retries. We describe the architecture, knowledge design,
integration choices, and a reproducible evaluation protocol for tabletop scan,
approach, and pick-place tasks. Experimental results demonstrate the efficacy
of the proposed approach. Our design shows that RAG-based planning can
substantially improve plan validity and adaptability while keeping perception
and low-level control local to the robot.
☆ Correlation-Aware Dual-View Pose and Velocity Estimation for Dynamic Robotic Manipulation
Accurate pose and velocity estimation is essential for effective spatial task
planning in robotic manipulators. While centralized sensor fusion has
traditionally been used to improve pose estimation accuracy, this paper
presents a novel decentralized fusion approach to estimate both pose and
velocity. We use dual-view measurements from an eye-in-hand and an eye-to-hand
vision sensor configuration mounted on a manipulator to track a target object
whose motion is modeled as random walk (stochastic acceleration model). The
robot runs two independent adaptive extended Kalman filters formulated on a
matrix Lie group, developed as part of this work. These filters predict poses
and velocities on the manifold $\mathbb{SE}(3) \times \mathbb{R}^3 \times
\mathbb{R}^3$ and update the state on the manifold $\mathbb{SE}(3)$. The final
fused state comprising the fused pose and velocities of the target is obtained
using a correlation-aware fusion rule on Lie groups. The proposed method is
evaluated on a UFactory xArm 850 equipped with Intel RealSense cameras,
tracking a moving target. Experimental results validate the effectiveness and
robustness of the proposed decentralized dual-view estimation framework,
showing consistent improvements over state-of-the-art methods.
♻ ☆ pRRTC: GPU-Parallel RRT-Connect for Fast, Consistent, and Low-Cost Motion Planning
Sampling-based motion planning algorithms, like the Rapidly-Exploring Random
Tree (RRT) and its widely used variant, RRT-Connect, provide efficient
solutions for high-dimensional planning problems faced by real-world robots.
However, these methods remain computationally intensive, particularly in
complex environments that require many collision checks. To improve
performance, recent efforts have explored parallelizing specific components of
RRT such as collision checking, or running multiple planners independently.
However, little has been done to develop an integrated parallelism approach,
co-designed for large-scale parallelism. In this work we present pRRTC, a
RRT-Connect based planner co-designed for GPU acceleration across the entire
algorithm through parallel expansion and SIMT-optimized collision checking. We
evaluate the effectiveness of pRRTC on the MotionBenchMaker dataset using
robots with 7, 8, and 14 degrees of freedom (DoF). Compared to the
state-of-the-art, pRRTC achieves as much as a 10x speedup on constrained
reaching tasks with a 5.4x reduction in standard deviation. pRRTC also achieves
a 1.4x reduction in average initial path cost. Finally, we deploy pRRTC on a
14-DoF dual Franka Panda arm setup and demonstrate real-time, collision-free
motion planning with dynamic obstacles. We open-source our planner to support
the wider community.
comment: 7 pages, 7 figures, 1 table. Submitted to IEEE International
Conference on Robotics and Automation 2026
♻ ☆ BC-ADMM: An Efficient Non-convex Constrained Optimizer with Robotic Applications
Non-convex constrained optimizations are ubiquitous in robotic applications
such as multi-agent navigation, UAV trajectory optimization, and soft robot
simulation. For this problem class, conventional optimizers suffer from small
step sizes and slow convergence. We propose BC-ADMM, a variant of Alternating
Direction Method of Multiplier (ADMM), that can solve a class of non-convex
constrained optimizations with biconvex constraint relaxation. Our algorithm
allows larger step sizes by breaking the problem into small-scale sub-problems
that can be easily solved in parallel. We show that our method has both
theoretical convergence speed guarantees and practical convergence guarantees
in the asymptotic sense. Through numerical experiments in a row of four robotic
applications, we show that BC-ADMM has faster convergence than conventional
gradient descent and Newton's method in terms of wall clock time.
♻ ☆ Toward Dynamic Control of Tendon-driven Continuum Robots using Clarke Transform IROS 2025
In this paper, we propose a dynamic model and control framework for
tendon-driven continuum robots (TDCRs) with multiple segments and an arbitrary
number of tendons per segment. Our approach leverages the Clarke transform, the
Euler-Lagrange formalism, and the piecewise constant curvature assumption to
formulate a dynamic model on a two-dimensional manifold embedded in the joint
space that inherently satisfies tendon constraints. We present linear and
constraint-informed controllers that operate directly on this manifold, along
with practical methods for preventing negative tendon forces without
compromising control fidelity. This opens up new design possibilities for
overactuated TDCRs with improved force distribution and stiffness without
increasing controller complexity. We validate these approaches in simulation
and on a physical prototype with one segment and five tendons, demonstrating
accurate dynamic behavior and robust trajectory tracking under real-time
conditions.
comment: Accepted for publication at IEEE/RSJ International Conference on
Intelligent Robots and Systems (IROS 2025), 8 pages, and 8 figures
♻ ☆ CottonSim: A vision-guided autonomous robotic system for cotton harvesting in Gazebo simulation
Thevathayarajh Thayananthan, Xin Zhang, Yanbo Huang, Jingdao Chen, Nuwan K. Wijewardane, Vitor S. Martins, Gary D. Chesser, Christopher T. Goodin
Cotton is a major cash crop in the United States, with the country being a
leading global producer and exporter. Nearly all U.S. cotton is grown in the
Cotton Belt, spanning 17 states in the southern region. Harvesting remains a
critical yet challenging stage, impacted by the use of costly, environmentally
harmful defoliants and heavy, expensive cotton pickers. These factors
contribute to yield loss, reduced fiber quality, and soil compaction, which
collectively threaten long-term sustainability. To address these issues, this
study proposes a lightweight, small-scale, vision-guided autonomous robotic
cotton picker as an alternative. An autonomous system, built on Clearpath's
Husky platform and integrated with the CottonEye perception system, was
developed and tested in the Gazebo simulation environment. A virtual cotton
field was designed to facilitate autonomous navigation testing. The navigation
system used Global Positioning System (GPS) and map-based guidance, assisted by
an RGBdepth camera and a YOLOv8nseg instance segmentation model. The model
achieved a mean Average Precision (mAP) of 85.2%, a recall of 88.9%, and a
precision of 93.0%. The GPS-based approach reached a 100% completion rate (CR)
within a $(5e-6)^{\circ}$ threshold, while the map-based method achieved a
96.7% CR within a 0.25 m threshold. The developed Robot Operating System (ROS)
packages enable robust simulation of autonomous cotton picking, offering a
scalable baseline for future agricultural robotics. CottonSim code and datasets
are publicly available on GitHub: https://github.com/imtheva/CottonSim
comment: 16 pages, 15 figures, 4 tables
♻ ☆ Capturing a Moving Target by Two Robots in the F2F Model
We study a search problem on capturing a moving target on an infinite real
line. Two autonomous mobile robots (which can move with a maximum speed of 1)
are initially placed at the origin, while an oblivious moving target is
initially placed at a distance $d$ away from the origin. The robots can move
along the line in any direction, but the target is oblivious, cannot change
direction, and moves either away from or toward the origin at a constant speed
$v$. Our aim is to design efficient algorithms for the two robots to capture
the target. The target is captured only when both robots are co-located with
it. The robots communicate with each other only face-to-face (F2F), meaning
they can exchange information only when co-located, while the target remains
oblivious and has no communication capabilities.
We design algorithms under various knowledge scenarios, which take into
account the prior knowledge the robots have about the starting distance $d$,
the direction of movement (either toward or away from the origin), and the
speed $v$ of the target. As a measure of the efficiency of the algorithms, we
use the competitive ratio, which is the ratio of the capture time of an
algorithm with limited knowledge to the capture time in the full-knowledge
model. In our analysis, we are mindful of the cost of changing direction of
movement, and show how to accomplish the capture of the target with at most
three direction changes (turns).
♻ ☆ Emergent interactions lead to collective frustration in robotic matter
Current artificial intelligence systems show near-human-level capabilities
when deployed in isolation. Systems of a few collaborating intelligent agents
are being engineered to perform tasks collectively. This raises the question of
whether robotic matter, where many learning and intelligent agents interact,
shows emergence of collective behaviour. And if so, which kind of phenomena
would such systems exhibit? Here, we study a paradigmatic model for robotic
matter: a stochastic many-particle system in which each particle is endowed
with a deep neural network that predicts its transitions based on the
particles' environments. For a one-dimensional model, we show that robotic
matter exhibits complex emergent phenomena, including transitions between
long-lived learning regimes, the emergence of particle species, and
frustration. We also find a density-dependent phase transition with signatures
of criticality. Using active matter theory, we show that this phase transition
is a consequence of self-organisation mediated by emergent inter-particle
interactions. Our simple model captures key features of more complex forms of
robotic systems.
♻ ☆ mindmap: Spatial Memory in Deep Feature Maps for 3D Action Policies
Remo Steiner, Alexander Millane, David Tingdahl, Clemens Volk, Vikram Ramasamy, Xinjie Yao, Peter Du, Soha Pouya, Shiwei Sheng
End-to-end learning of robot control policies, structured as neural networks,
has emerged as a promising approach to robotic manipulation. To complete many
common tasks, relevant objects are required to pass in and out of a robot's
field of view. In these settings, spatial memory - the ability to remember the
spatial composition of the scene - is an important competency. However,
building such mechanisms into robot learning systems remains an open research
problem. We introduce mindmap (Spatial Memory in Deep Feature Maps for 3D
Action Policies), a 3D diffusion policy that generates robot trajectories based
on a semantic 3D reconstruction of the environment. We show in simulation
experiments that our approach is effective at solving tasks where
state-of-the-art approaches without memory mechanisms struggle. We release our
reconstruction system, training code, and evaluation tasks to spur research in
this direction.
comment: Accepted to CoRL 2025 Workshop RemembeRL
♻ ★ FlowVLA: Visual Chain of Thought-based Motion Reasoning for Vision-Language-Action Models
Zhide Zhong, Haodong Yan, Junfeng Li, Xiangchen Liu, Xin Gong, Tianran Zhang, Wenxuan Song, Jiayi Chen, Xinhu Zheng, Hesheng Wang, Haoang Li
Many Vision-Language-Action (VLA) models are built upon an internal world
model trained via next-frame prediction ``$v_t \rightarrow v_{t+1}$''. However,
this paradigm attempts to predict the future frame's appearance directly,
without explicitly reasoning about the underlying dynamics. \textbf{This lack
of an explicit motion reasoning step} often leads to physically implausible
visual forecasts and inefficient policy learning. To address this limitation,
we introduce the \textbf{Visual Chain of Thought (Visual CoT)}, a paradigm that
compels the model to first reason about \textbf{motion dynamics} before
generating the future frame. We instantiate this paradigm by proposing
\textbf{FlowVLA}, an autoregressive Transformer that explicitly materializes
this reasoning process as ``$v_t \rightarrow f_t \rightarrow v_{t+1}$'', where
$f_t$ is an intermediate optical flow prediction that inherently encodes
motion. By forcing the model to first follow the motion plan encoded by $f_t$,
this process inherently \textbf{aligns the pre-training objective of dynamics
prediction with the downstream task of action generation.} We conduct
experiments on challenging robotics manipulation benchmarks, as well as
real-robot evaluations. Our FlowVLA not only generates \textbf{more coherent
and physically plausible visual predictions}, but also achieves
state-of-the-art policy performance with \textbf{substantially improved sample
efficiency}, pointing toward a more principled foundation for world modeling in
VLAs. Project page: https://irpn-lab.github.io/FlowVLA/
♻ ☆ Equivariant Filter for Relative Attitude and Target's Angular Velocity Estimation
Accurate estimation of the relative attitude and angular velocity between two
rigid bodies is fundamental in aerospace applications such as spacecraft
rendezvous and docking. In these scenarios, a chaser vehicle must determine the
orientation and angular velocity of a target object using onboard sensors. This
work addresses the challenge of designing an Equivariant Filter (EqF) that can
reliably estimate both the relative attitude and the target angular velocity
using noisy observations of two known, non-collinear vectors fixed in the
target frame. To derive the EqF, a symmetry for the system is proposed and an
equivariant lift onto the symmetry group is calculated. Observability and
convergence properties are analyzed. Simulations demonstrate the filter's
performance, with Monte Carlo runs yielding statistically significant results.
The impact of low-rate measurements is also examined and a strategy to mitigate
this effect is proposed. Experimental results, using fiducial markers and both
conventional and event cameras for measurement acquisition, further validate
the approach, confirming its effectiveness in a realistic setting.
comment: This work has been submitted to the IEEE for possible publication
♻ ☆ Identifying Uncertainty in Self-Adaptive Robotics with Large Language Models
Future self-adaptive robots are expected to operate in highly dynamic
environments while effectively managing uncertainties. However, identifying the
sources and impacts of uncertainties in such robotic systems and defining
appropriate mitigation strategies is challenging due to the inherent complexity
of self-adaptive robots and the lack of comprehensive knowledge about the
various factors influencing uncertainty. Hence, practitioners often rely on
intuition and past experiences from similar systems to address uncertainties.
In this article, we evaluate the potential of large language models (LLMs) in
enabling a systematic and automated approach to identify uncertainties in
self-adaptive robotics throughout the software engineering lifecycle. For this
evaluation, we analyzed 10 advanced LLMs with varying capabilities across four
industrial-sized robotics case studies, gathering the practitioners'
perspectives on the LLM-generated responses related to uncertainties. Results
showed that practitioners agreed with 63-88% of the LLM responses and expressed
strong interest in the practicality of LLMs for this purpose.
♻ ☆ Image-Based Visual Servoing for Enhanced Cooperation of Dual-Arm Manipulation
The cooperation of a pair of robot manipulators is required to manipulate a
target object without any fixtures. The conventional control methods coordinate
the end-effector pose of each manipulator with that of the other using their
kinematics and joint coordinate measurements. Yet, the manipulators' inaccurate
kinematics and joint coordinate measurements can cause significant pose
synchronization errors in practice. This paper thus proposes an image-based
visual servoing approach for enhancing the cooperation of a dual-arm
manipulation system. On top of the classical control, the visual servoing
controller lets each manipulator use its carried camera to measure the image
features of the other's marker and adapt its end-effector pose with the
counterpart on the move. Because visual measurements are robust to kinematic
errors, the proposed control can reduce the end-effector pose synchronization
errors and the fluctuations of the interaction forces of the pair of
manipulators on the move. Theoretical analyses have rigorously proven the
stability of the closed-loop system. Comparative experiments on real robots
have substantiated the effectiveness of the proposed control.
comment: 8 pages, 7 figures. Project website:
https://zizhe.io/ral-ibvs-enhanced/. This work has been accepted to the IEEE
Robotics and Automation Letters in Feb 2025
♻ ☆ Interpreting Behaviors and Geometric Constraints as Knowledge Graphs for Robot Manipulation Control
In this paper, we investigate the feasibility of using knowledge graphs to
interpret actions and behaviors for robot manipulation control. Equipped with
an uncalibrated visual servoing controller, we propose to use robot knowledge
graphs to unify behavior trees and geometric constraints, conceptualizing robot
manipulation control as semantic events. The robot knowledge graphs not only
preserve the advantages of behavior trees in scripting actions and behaviors,
but also offer additional benefits of mapping natural interactions between
concepts and events, which enable knowledgeable explanations of the
manipulation contexts. Through real-world evaluations, we demonstrate the
flexibility of the robot knowledge graphs to support explainable robot
manipulation control.
♻ ☆ RoboMemory: A Brain-inspired Multi-memory Agentic Framework for Interactive Environmental Learning in Physical Embodied Systems
Mingcong Lei, Honghao Cai, Zezhou Cui, Liangchen Tan, Junkun Hong, Gehan Hu, Shuangyu Zhu, Yimou Wu, Shaohan Jiang, Ge Wang, Yuyuan Yang, Junyuan Tan, Zhenglin Wan, Zhen Li, Shuguang Cui, Yiming Zhao, Yatong Han
Embodied agents face persistent challenges in real-world environments,
including partial observability, limited spatial reasoning, and high-latency
multi-memory integration. We present RoboMemory, a brain-inspired framework
that unifies Spatial, Temporal, Episodic, and Semantic memory under a
parallelized architecture for efficient long-horizon planning and interactive
environmental learning. A dynamic spatial knowledge graph (KG) ensures scalable
and consistent memory updates, while a closed-loop planner with a critic module
supports adaptive decision-making in dynamic settings. Experiments on
EmbodiedBench show that RoboMemory, built on Qwen2.5-VL-72B-Ins, improves
average success rates by 25% over its baseline and exceeds the closed-source
state-of-the-art (SOTA) Gemini-1.5-Pro by 3%. Real-world trials further confirm
its capacity for cumulative learning, with performance improving across
repeated tasks. These results highlight RoboMemory as a scalable foundation for
memory-augmented embodied intelligence, bridging the gap between cognitive
neuroscience and robotic autonomy.
♻ ☆ Self-Supervised Representation Learning with Joint Embedding Predictive Architecture for Automotive LiDAR Object Detection
Recently, self-supervised representation learning relying on vast amounts of
unlabeled data has been explored as a pre-training method for autonomous
driving. However, directly applying popular contrastive or generative methods
to this problem is insufficient and may even lead to negative transfer. In this
paper, we present AD-L-JEPA, a novel self-supervised pre-training framework
with a joint embedding predictive architecture (JEPA) for automotive LiDAR
object detection. Unlike existing methods, AD-L-JEPA is neither generative nor
contrastive. Instead of explicitly generating masked regions, our method
predicts Bird's-Eye-View embeddings to capture the diverse nature of driving
scenes. Furthermore, our approach eliminates the need to manually form
contrastive pairs by employing explicit variance regularization to avoid
representation collapse. Experimental results demonstrate consistent
improvements on the LiDAR 3D object detection downstream task across the
KITTI3D, Waymo, and ONCE datasets, while reducing GPU hours by 1.9x-2.7x and
GPU memory by 2.8x-4x compared with the state-of-the-art method Occupancy-MAE.
Notably, on the largest ONCE dataset, pre-training on 100K frames yields a 1.61
mAP gain, better than all other methods pre-trained on either 100K or 500K
frames, and pre-training on 500K frames yields a 2.98 mAP gain, better than all
other methods pre-trained on either 500K or 1M frames. AD-L-JEPA constitutes
the first JEPA-based pre-training method for autonomous driving. It offers
better quality, faster, and more GPU-memory-efficient self-supervised
representation learning. The source code of AD-L-JEPA is ready to be released.