Robotics 41
☆ GeoVLA: Empowering 3D Representations in Vision-Language-Action Models
Vision-Language-Action (VLA) models have emerged as a promising approach for
enabling robots to follow language instructions and predict corresponding
actions.However, current VLA models mainly rely on 2D visual inputs, neglecting
the rich geometric information in the 3D physical world, which limits their
spatial awareness and adaptability. In this paper, we present GeoVLA, a novel
VLA framework that effectively integrates 3D information to advance robotic
manipulation. It uses a vision-language model (VLM) to process images and
language instructions,extracting fused vision-language embeddings. In parallel,
it converts depth maps into point clouds and employs a customized point
encoder, called Point Embedding Network, to generate 3D geometric embeddings
independently. These produced embeddings are then concatenated and processed by
our proposed spatial-aware action expert, called 3D-enhanced Action Expert,
which combines information from different sensor modalities to produce precise
action sequences. Through extensive experiments in both simulation and
real-world environments, GeoVLA demonstrates superior performance and
robustness. It achieves state-of-the-art results in the LIBERO and ManiSkill2
simulation benchmarks and shows remarkable robustness in real-world tasks
requiring height adaptability, scale awareness and viewpoint invariance.
comment: The project is visible at https://linsun449.github.io/GeoVLA/
☆ Spatial Traces: Enhancing VLA Models with Spatial-Temporal Understanding
Vision-Language-Action models have demonstrated remarkable capabilities in
predicting agent movements within virtual environments and real-world scenarios
based on visual observations and textual instructions. Although recent research
has focused on enhancing spatial and temporal understanding independently, this
paper presents a novel approach that integrates both aspects through visual
prompting. We introduce a method that projects visual traces of key points from
observations onto depth maps, enabling models to capture both spatial and
temporal information simultaneously. The experiments in SimplerEnv show that
the mean number of tasks successfully solved increased for 4% compared to
SpatialVLA and 19% compared to TraceVLA. Furthermore, we show that this
enhancement can be achieved with minimal training data, making it particularly
valuable for real-world applications where data collection is challenging. The
project page is available at https://ampiromax.github.io/ST-VLA.
☆ Large Scale Robotic Material Handling: Learning, Planning, and Control
Filippo A. Spinelli, Yifan Zhai, Fang Nan, Pascal Egli, Julian Nubert, Thilo Bleumer, Lukas Miller, Ferdinand Hofmann, Marco Hutter
Bulk material handling involves the efficient and precise moving of large
quantities of materials, a core operation in many industries, including cargo
ship unloading, waste sorting, construction, and demolition. These repetitive,
labor-intensive, and safety-critical operations are typically performed using
large hydraulic material handlers equipped with underactuated grippers. In this
work, we present a comprehensive framework for the autonomous execution of
large-scale material handling tasks. The system integrates specialized modules
for environment perception, pile attack point selection, path planning, and
motion control. The main contributions of this work are two reinforcement
learning-based modules: an attack point planner that selects optimal grasping
locations on the material pile to maximize removal efficiency and minimize the
number of scoops, and a robust trajectory following controller that addresses
the precision and safety challenges associated with underactuated grippers in
movement, while utilizing their free-swinging nature to release material
through dynamic throwing. We validate our framework through real-world
experiments on a 40 t material handler in a representative worksite, focusing
on two key tasks: high-throughput bulk pile management and high-precision truck
loading. Comparative evaluations against human operators demonstrate the
system's effectiveness in terms of precision, repeatability, and operational
safety. To the best of our knowledge, this is the first complete automation of
material handling tasks on a full scale.
comment: Preliminary version, currently undergoing review process
☆ Generation of Real-time Robotic Emotional Expressions Learning from Human Demonstration in Mixed Reality
Expressive behaviors in robots are critical for effectively conveying their
emotional states during interactions with humans. In this work, we present a
framework that autonomously generates realistic and diverse robotic emotional
expressions based on expert human demonstrations captured in Mixed Reality
(MR). Our system enables experts to teleoperate a virtual robot from a
first-person perspective, capturing their facial expressions, head movements,
and upper-body gestures, and mapping these behaviors onto corresponding robotic
components including eyes, ears, neck, and arms. Leveraging a
flow-matching-based generative process, our model learns to produce coherent
and varied behaviors in real-time in response to moving objects, conditioned
explicitly on given emotional states. A preliminary test validated the
effectiveness of our approach for generating autonomous expressions.
comment: 4
☆ Rational Inverse Reasoning
Humans can observe a single, imperfect demonstration and immediately
generalize to very different problem settings. Robots, in contrast, often
require hundreds of examples and still struggle to generalize beyond the
training conditions. We argue that this limitation arises from the inability to
recover the latent explanations that underpin intelligent behavior, and that
these explanations can take the form of structured programs consisting of
high-level goals, sub-task decomposition, and execution constraints. In this
work, we introduce Rational Inverse Reasoning (RIR), a framework for inferring
these latent programs through a hierarchical generative model of behavior. RIR
frames few-shot imitation as Bayesian program induction: a vision-language
model iteratively proposes structured symbolic task hypotheses, while a
planner-in-the-loop inference scheme scores each by the likelihood of the
observed demonstration under that hypothesis. This loop yields a posterior over
concise, executable programs. We evaluate RIR on a suite of continuous
manipulation tasks designed to test one-shot and few-shot generalization across
variations in object pose, count, geometry, and layout. With as little as one
demonstration, RIR infers the intended task structure and generalizes to novel
settings, outperforming state-of-the-art vision-language model baselines.
☆ Unsupervised Skill Discovery as Exploration for Learning Agile Locomotion
Exploration is crucial for enabling legged robots to learn agile locomotion
behaviors that can overcome diverse obstacles. However, such exploration is
inherently challenging, and we often rely on extensive reward engineering,
expert demonstrations, or curriculum learning - all of which limit
generalizability. In this work, we propose Skill Discovery as Exploration
(SDAX), a novel learning framework that significantly reduces human engineering
effort. SDAX leverages unsupervised skill discovery to autonomously acquire a
diverse repertoire of skills for overcoming obstacles. To dynamically regulate
the level of exploration during training, SDAX employs a bi-level optimization
process that autonomously adjusts the degree of exploration. We demonstrate
that SDAX enables quadrupedal robots to acquire highly agile behaviors
including crawling, climbing, leaping, and executing complex maneuvers such as
jumping off vertical walls. Finally, we deploy the learned policy on real
hardware, validating its successful transfer to the real world.
comment: Conference on Robot Learning 2025
☆ Shape Completion and Real-Time Visualization in Robotic Ultrasound Spine Acquisitions
Ultrasound (US) imaging is increasingly used in spinal procedures due to its
real-time, radiation-free capabilities; however, its effectiveness is hindered
by shadowing artifacts that obscure deeper tissue structures. Traditional
approaches, such as CT-to-US registration, incorporate anatomical information
from preoperative CT scans to guide interventions, but they are limited by
complex registration requirements, differences in spine curvature, and the need
for recent CT imaging. Recent shape completion methods can offer an alternative
by reconstructing spinal structures in US data, while being pretrained on large
set of publicly available CT scans. However, these approaches are typically
offline and have limited reproducibility. In this work, we introduce a novel
integrated system that combines robotic ultrasound with real-time shape
completion to enhance spinal visualization. Our robotic platform autonomously
acquires US sweeps of the lumbar spine, extracts vertebral surfaces from
ultrasound, and reconstructs the complete anatomy using a deep learning-based
shape completion network. This framework provides interactive, real-time
visualization with the capability to autonomously repeat scans and can enable
navigation to target locations. This can contribute to better consistency,
reproducibility, and understanding of the underlying anatomy. We validate our
approach through quantitative experiments assessing shape completion accuracy
and evaluations of multiple spine acquisition protocols on a phantom setup.
Additionally, we present qualitative results of the visualization on a
volunteer scan.
☆ Towards Affordance-Aware Robotic Dexterous Grasping with Human-like Priors
Haoyu Zhao, Linghao Zhuang, Xingyue Zhao, Cheng Zeng, Haoran Xu, Yuming Jiang, Jun Cen, Kexiang Wang, Jiayan Guo, Siteng Huang, Xin Li, Deli Zhao, Hua Zou
A dexterous hand capable of generalizable grasping objects is fundamental for
the development of general-purpose embodied AI. However, previous methods focus
narrowly on low-level grasp stability metrics, neglecting affordance-aware
positioning and human-like poses which are crucial for downstream manipulation.
To address these limitations, we propose AffordDex, a novel framework with
two-stage training that learns a universal grasping policy with an inherent
understanding of both motion priors and object affordances. In the first stage,
a trajectory imitator is pre-trained on a large corpus of human hand motions to
instill a strong prior for natural movement. In the second stage, a residual
module is trained to adapt these general human-like motions to specific object
instances. This refinement is critically guided by two components: our Negative
Affordance-aware Segmentation (NAA) module, which identifies functionally
inappropriate contact regions, and a privileged teacher-student distillation
process that ensures the final vision-based policy is highly successful.
Extensive experiments demonstrate that AffordDex not only achieves universal
dexterous grasping but also remains remarkably human-like in posture and
functionally appropriate in contact location. As a result, AffordDex
significantly outperforms state-of-the-art baselines across seen objects,
unseen instances, and even entirely novel categories.
comment: 13 pages, 8 figures
☆ DiffPhysCam: Differentiable Physics-Based Camera Simulation for Inverse Rendering and Embodied AI
We introduce DiffPhysCam, a differentiable camera simulator designed to
support robotics and embodied AI applications by enabling gradient-based
optimization in visual perception pipelines. Generating synthetic images that
closely mimic those from real cameras is essential for training visual models
and enabling end-to-end visuomotor learning. Moreover, differentiable rendering
allows inverse reconstruction of real-world scenes as digital twins,
facilitating simulation-based robotics training. However, existing virtual
cameras offer limited control over intrinsic settings, poorly capture optical
artifacts, and lack tunable calibration parameters -- hindering sim-to-real
transfer. DiffPhysCam addresses these limitations through a multi-stage
pipeline that provides fine-grained control over camera settings, models key
optical effects such as defocus blur, and supports calibration with real-world
data. It enables both forward rendering for image synthesis and inverse
rendering for 3D scene reconstruction, including mesh and material texture
optimization. We show that DiffPhysCam enhances robotic perception performance
in synthetic image tasks. As an illustrative example, we create a digital twin
of a real-world scene using inverse rendering, simulate it in a multi-physics
environment, and demonstrate navigation of an autonomous ground vehicle using
images generated by DiffPhysCam.
comment: 19 pages, 17 figures, and 4 tables
☆ Robot can reduce superior's dominance in group discussions with human social hierarchy
This study investigated whether robotic agents that deal with social
hierarchical relationships can reduce the dominance of superiors and equalize
participation among participants in discussions with hierarchical structures.
Thirty doctors and students having hierarchical relationship were gathered as
participants, and an intervention experiment was conducted using a robot that
can encourage participants to speak depending on social hierarchy. These were
compared with strategies that intervened equally for all participants without
considering hierarchy and with a no-action. The robots performed follow
actions, showing backchanneling to speech, and encourage actions, prompting
speech from members with less speaking time, on the basis of the hierarchical
relationships among group members to equalize participation. The experimental
results revealed that the robot's actions could potentially influence the
speaking time among members, but it could not be conclusively stated that there
were significant differences between the robot's action conditions. However,
the results suggested that it might be possible to influence speaking time
without decreasing the satisfaction of superiors. This indicates that in
discussion scenarios where experienced superiors are likely to dominate,
controlling the robot's backchanneling behavior could potentially suppress
dominance and equalize participation among group members.
comment: 8 pages, 7 figures. International Conference on Human-Agent
Interaction (HAI '24), November 24-27, 2024, Swansea, United Kingdom
☆ Visual Prompting for Robotic Manipulation with Annotation-Guided Pick-and-Place Using ACT
Robotic pick-and-place tasks in convenience stores pose challenges due to
dense object arrangements, occlusions, and variations in object properties such
as color, shape, size, and texture. These factors complicate trajectory
planning and grasping. This paper introduces a perception-action pipeline
leveraging annotation-guided visual prompting, where bounding box annotations
identify both pickable objects and placement locations, providing structured
spatial guidance. Instead of traditional step-by-step planning, we employ
Action Chunking with Transformers (ACT) as an imitation learning algorithm,
enabling the robotic arm to predict chunked action sequences from human
demonstrations. This facilitates smooth, adaptive, and data-driven
pick-and-place operations. We evaluate our system based on success rate and
visual analysis of grasping behavior, demonstrating improved grasp accuracy and
adaptability in retail environments.
☆ Boosting Action-Information via a Variational Bottleneck on Unlabelled Robot Videos
Learning from demonstrations (LfD) typically relies on large amounts of
action-labeled expert trajectories, which fundamentally constrains the scale of
available training data. A promising alternative is to learn directly from
unlabeled video demonstrations. However, we find that existing methods tend to
encode latent actions that share little mutual information with the true robot
actions, leading to suboptimal control performance. To address this limitation,
we introduce a novel framework that explicitly maximizes the mutual information
between latent actions and true actions, even in the absence of action labels.
Our method leverage the variational information-bottleneck to extract
action-relevant representations while discarding task-irrelevant information.
We provide a theoretical analysis showing that our objective indeed maximizes
the mutual information between latent and true actions. Finally, we validate
our approach through extensive experiments: first in simulated robotic
environments and then on real-world robotic platforms, the experimental results
demonstrate that our method significantly enhances mutual information and
consistently improves policy performance.
☆ CRADLE: Conversational RTL Design Space Exploration with LLM-based Multi-Agent Systems
This paper presents CRADLE, a conversational framework for design space
exploration of RTL designs using LLM-based multi-agent systems. Unlike existing
rigid approaches, CRADLE enables user-guided flows with internal
self-verification, correction, and optimization. We demonstrate the framework
with a generator-critic agent system targeting FPGA resource minimization using
state-of-the-art LLMs. Experimental results on the RTLLM benchmark show that
CRADLE achieves significant reductions in resource usage with averages of 48%
and 40% in LUTs and FFs across all benchmark designs.
comment: Accepted for presentation at the 22nd International SoC Conference
(ISOCC 2025). Proceedings to be included in IEEE Xplore
☆ Towards Safe Imitation Learning via Potential Field-Guided Flow Matching IROS 2025
Deep generative models, particularly diffusion and flow matching models, have
recently shown remarkable potential in learning complex policies through
imitation learning. However, the safety of generated motions remains
overlooked, particularly in complex environments with inherent obstacles. In
this work, we address this critical gap by proposing Potential Field-Guided
Flow Matching Policy (PF2MP), a novel approach that simultaneously learns task
policies and extracts obstacle-related information, represented as a potential
field, from the same set of successful demonstrations. During inference, PF2MP
modulates the flow matching vector field via the learned potential field,
enabling safe motion generation. By leveraging these complementary fields, our
approach achieves improved safety without compromising task success across
diverse environments, such as navigation tasks and robotic manipulation
scenarios. We evaluate PF2MP in both simulation and real-world settings,
demonstrating its effectiveness in task space and joint space control.
Experimental results demonstrate that PF2MP enhances safety, achieving a
significant reduction of collisions compared to baseline policies. This work
paves the way for safer motion generation in unstructured and obstaclerich
environments.
comment: 8 pages, 6 figures, Accepted to IROS 2025
★ OmniVTLA: Vision-Tactile-Language-Action Model with Semantic-Aligned Tactile Sensing
Recent vision-language-action (VLA) models build upon vision-language
foundations, and have achieved promising results and exhibit the possibility of
task generalization in robot manipulation. However, due to the heterogeneity of
tactile sensors and the difficulty of acquiring tactile data, current VLA
models significantly overlook the importance of tactile perception and fail in
contact-rich tasks. To address this issue, this paper proposes OmniVTLA, a
novel architecture involving tactile sensing. Specifically, our contributions
are threefold. First, our OmniVTLA features a dual-path tactile encoder
framework. This framework enhances tactile perception across diverse
vision-based and force-based tactile sensors by using a pretrained vision
transformer (ViT) and a semantically-aligned tactile ViT (SA-ViT). Second, we
introduce ObjTac, a comprehensive force-based tactile dataset capturing
textual, visual, and tactile information for 56 objects across 10 categories.
With 135K tri-modal samples, ObjTac supplements existing visuo-tactile
datasets. Third, leveraging this dataset, we train a semantically-aligned
tactile encoder to learn a unified tactile representation, serving as a better
initialization for OmniVTLA. Real-world experiments demonstrate substantial
improvements over state-of-the-art VLA baselines, achieving 96.9% success rates
with grippers, (21.9% higher over baseline) and 100% success rates with
dexterous hands (6.2% higher over baseline) in pick-and-place tasks. Besides,
OmniVTLA significantly reduces task completion time and generates smoother
trajectories through tactile sensing compared to existing VLA.
comment: 15 pages, 7 figures, 8 tables
☆ ZS-Puffin: Design, Modeling and Implementation of an Unmanned Aerial-Aquatic Vehicle with Amphibious Wings IROS 2025
Unmanned aerial-aquatic vehicles (UAAVs) can operate both in the air and
underwater, giving them broad application prospects. Inspired by the
dual-function wings of puffins, we propose a UAAV with amphibious wings to
address the challenge posed by medium differences on the vehicle's propulsion
system. The amphibious wing, redesigned based on a fixed-wing structure,
features a single degree of freedom in pitch and requires no additional
components. It can generate lift in the air and function as a flapping wing for
propulsion underwater, reducing disturbance to marine life and making it
environmentally friendly. Additionally, an artificial central pattern generator
(CPG) is introduced to enhance the smoothness of the flapping motion. This
paper presents the prototype, design details, and practical implementation of
this concept.
comment: Accepted to IROS 2025
☆ Communication Efficient Robotic Mixed Reality with Gaussian Splatting Cross-Layer Optimization
Realizing low-cost communication in robotic mixed reality (RoboMR) systems
presents a challenge, due to the necessity of uploading high-resolution images
through wireless channels. This paper proposes Gaussian splatting (GS) RoboMR
(GSMR), which enables the simulator to opportunistically render a
photo-realistic view from the robot's pose by calling ``memory'' from a GS
model, thus reducing the need for excessive image uploads. However, the GS
model may involve discrepancies compared to the actual environments. To this
end, a GS cross-layer optimization (GSCLO) framework is further proposed, which
jointly optimizes content switching (i.e., deciding whether to upload image or
not) and power allocation (i.e., adjusting to content profiles) across
different frames by minimizing a newly derived GSMR loss function. The GSCLO
problem is addressed by an accelerated penalty optimization (APO) algorithm
that reduces computational complexity by over $10$x compared to traditional
branch-and-bound and search algorithms. Moreover, variants of GSCLO are
presented to achieve robust, low-power, and multi-robot GSMR. Extensive
experiments demonstrate that the proposed GSMR paradigm and GSCLO method
achieve significant improvements over existing benchmarks on both wheeled and
legged robots in terms of diverse metrics in various scenarios. For the first
time, it is found that RoboMR can be achieved with ultra-low communication
costs, and mixture of data is useful for enhancing GS performance in dynamic
scenarios.
comment: 14 pages, 18 figures, to appear in IEEE Transactions on Cognitive
Communications and Networking
☆ Autonomous Mobile Plant Watering Robot : A Kinematic Approach
Plants need regular and the appropriate amount of watering to thrive and
survive. While agricultural robots exist that can spray water on plants and
crops such as the , they are expensive and have limited mobility and/or
functionality. We introduce a novel autonomous mobile plant watering robot that
uses a 6 degree of freedom (DOF) manipulator, connected to a 4 wheel drive
alloy chassis, to be able to hold a garden hose, recognize and detect plants,
and to water them with the appropriate amount of water by being able to insert
a soil humidity/moisture sensor into the soil. The robot uses Jetson Nano and
Arduino microcontroller and real sense camera to perform computer vision to
detect plants using real-time YOLOv5 with the Pl@ntNet-300K dataset. The robot
uses LIDAR for object and collision avoideance and does not need to move on a
pre-defined path and can keep track of which plants it has watered. We provide
the Denavit-Hartenberg (DH) Table, forward kinematics, differential driving
kinematics, and inverse kinematics along with simulation and experiment results
☆ Developing a Calibrated Physics-Based Digital Twin for Construction Vehicles
This paper presents the development of a calibrated digital twin of a wheel
loader. A calibrated digital twin integrates a construction vehicle with a
high-fidelity digital model allowing for automated diagnostics and optimization
of operations as well as pre-planning simulations enhancing automation
capabilities. The high-fidelity digital model is a virtual twin of the physical
wheel loader. It uses a physics-based multibody dynamic model of the wheel
loader in the software AGX Dynamics. Interactions of the wheel loader's bucket
while in use in construction can be simulated in the virtual model. Calibration
makes this simulation of high-fidelity which can enhance realistic planning for
automation of construction operations. In this work, a wheel loader was
instrumented with several sensors used to calibrate the digital model. The
calibrated digital twin was able to estimate the magnitude of the forces on the
bucket base with high accuracy, providing a high-fidelity simulation.
☆ DeepFleet: Multi-Agent Foundation Models for Mobile Robots
Ameya Agaskar, Sriram Siva, William Pickering, Kyle O'Brien, Charles Kekeh, Ang Li, Brianna Gallo Sarker, Alicia Chua, Mayur Nemade, Charun Thattai, Jiaming Di, Isaac Iyengar, Ramya Dharoor, Dino Kirouani, Jimmy Erskine, Tamir Hegazy, Scott Niekum, Usman A. Khan, Federico Pecora, Joseph W. Durham
We introduce DeepFleet, a suite of foundation models designed to support
coordination and planning for large-scale mobile robot fleets. These models are
trained on fleet movement data, including robot positions, goals, and
interactions, from hundreds of thousands of robots in Amazon warehouses
worldwide. DeepFleet consists of four architectures that each embody a distinct
inductive bias and collectively explore key points in the design space for
multi-agent foundation models: the robot-centric (RC) model is an
autoregressive decision transformer operating on neighborhoods of individual
robots; the robot-floor (RF) model uses a transformer with cross-attention
between robots and the warehouse floor; the image-floor (IF) model applies
convolutional encoding to a multi-channel image representation of the full
fleet; and the graph-floor (GF) model combines temporal attention with graph
neural networks for spatial relationships. In this paper, we describe these
models and present our evaluation of the impact of these design choices on
prediction task performance. We find that the robot-centric and graph-floor
models, which both use asynchronous robot state updates and incorporate the
localized structure of robot interactions, show the most promise. We also
present experiments that show that these two models can make effective use of
larger warehouses operation datasets as the models are scaled up.
comment: 25 pages, 10 figures, 2 tables
♻ ☆ BeyondMimic: From Motion Tracking to Versatile Humanoid Control via Guided Diffusion
Learning skills from human motions offers a promising path toward
generalizable policies for whole-body humanoid control, yet two key
cornerstones are missing: (1) a high-quality motion tracking framework that
faithfully transforms large-scale kinematic references into robust and
extremely dynamic motions on real hardware, and (2) a distillation approach
that can effectively learn these motion primitives and compose them to solve
downstream tasks. We address these gaps with BeyondMimic, a real-world
framework to learn from human motions for versatile and naturalistic humanoid
control via guided diffusion. Our framework provides a motion tracking pipeline
capable of challenging skills such as jumping spins, sprinting, and cartwheels
with state-of-the-art motion quality. Moving beyond mimicking existing motions
and synthesize novel ones, we further introduce a unified diffusion policy that
enables zero-shot task-specific control at test time using simple cost
functions. Deployed on hardware, BeyondMimic performs diverse tasks at test
time, including waypoint navigation, joystick teleoperation, and obstacle
avoidance, bridging sim-to-real motion tracking and flexible synthesis of human
motion primitives for whole-body control. https://beyondmimic.github.io/.
comment: fix footnote and math
♻ ☆ Touch and Tell: Multimodal Decoding of Human Emotions and Social Gestures for Robots
Human emotions are complex and can be conveyed through nuanced touch
gestures. Previous research has primarily focused on how humans recognize
emotions through touch or on identifying key features of emotional expression
for robots. However, there is a gap in understanding how reliably these
emotions and gestures can be communicated to robots via touch and interpreted
using data driven methods. This study investigates the consistency and
distinguishability of emotional and gestural expressions through touch and
sound. To this end, we integrated a custom piezoresistive pressure sensor as
well as a microphone on a social robot. Twenty-eight participants first
conveyed ten different emotions to the robot using spontaneous touch gestures,
then they performed six predefined social touch gestures. Our findings reveal
statistically significant consistency in both emotion and gesture expression
among participants. However, some emotions exhibited low intraclass correlation
values, and certain emotions with similar levels of arousal or valence did not
show significant differences in their conveyance. To investigate emotion and
social gesture decoding within affective human-robot tactile interaction, we
developed single-modality models and multimodal models integrating tactile and
auditory features. A support vector machine (SVM) model trained on multimodal
features achieved the highest accuracy for classifying ten emotions, reaching
40 %.For gesture classification, a Convolutional Neural Network- Long
Short-Term Memory Network (CNN-LSTM) achieved 90.74 % accuracy. Our results
demonstrate that even though the unimodal models have the potential to decode
emotions and touch gestures, the multimodal integration of touch and sound
significantly outperforms unimodal approaches, enhancing the decoding of both
emotions and gestures.
♻ ★ Joint State and Noise Covariance Estimation
This paper tackles the problem of jointly estimating the noise covariance
matrix alongside states (parameters such as poses and points) from measurements
corrupted by Gaussian noise and, if available, prior information. In such
settings, the noise covariance matrix determines the weights assigned to
individual measurements in the least squares problem. We show that the joint
problem exhibits a convex structure and provide a full characterization of the
optimal noise covariance estimate (with analytical solutions) within joint
maximum a posteriori and likelihood frameworks and several variants. Leveraging
this theoretical result, we propose two novel algorithms that jointly estimate
the primary parameters and the noise covariance matrix. Our BCD algorithm can
be easily integrated into existing nonlinear least squares solvers, with
negligible per-iteration computational overhead. To validate our approach, we
conduct extensive experiments across diverse scenarios and offer practical
insights into their application in robotics and computer vision estimation
problems with a particular focus on SLAM.
comment: Adds a missing related work [4]
♻ ☆ LM-MCVT: A Lightweight Multi-modal Multi-view Convolutional-Vision Transformer Approach for 3D Object Recognition
In human-centered environments such as restaurants, homes, and warehouses,
robots often face challenges in accurately recognizing 3D objects. These
challenges stem from the complexity and variability of these environments,
including diverse object shapes. In this paper, we propose a novel Lightweight
Multi-modal Multi-view Convolutional-Vision Transformer network (LM-MCVT) to
enhance 3D object recognition in robotic applications. Our approach leverages
the Globally Entropy-based Embeddings Fusion (GEEF) method to integrate
multi-views efficiently. The LM-MCVT architecture incorporates pre- and
mid-level convolutional encoders and local and global transformers to enhance
feature extraction and recognition accuracy. We evaluate our method on the
synthetic ModelNet40 dataset and achieve a recognition accuracy of 95.6% using
a four-view setup, surpassing existing state-of-the-art methods. To further
validate its effectiveness, we conduct 5-fold cross-validation on the
real-world OmniObject3D dataset using the same configuration. Results
consistently show superior performance, demonstrating the method's robustness
in 3D object recognition across synthetic and real-world 3D data.
♻ ☆ GMF-Drive: Gated Mamba Fusion with Spatial-Aware BEV Representation for End-to-End Autonomous Driving
Diffusion-based models are redefining the state-of-the-art in end-to-end
autonomous driving, yet their performance is increasingly hampered by a
reliance on transformer-based fusion. These architectures face fundamental
limitations: quadratic computational complexity restricts the use of
high-resolution features, and a lack of spatial priors prevents them from
effectively modeling the inherent structure of Bird's Eye View (BEV)
representations. This paper introduces GMF-Drive (Gated Mamba Fusion for
Driving), an end-to-end framework that overcomes these challenges through two
principled innovations. First, we supersede the information-limited
histogram-based LiDAR representation with a geometrically-augmented pillar
format encoding shape descriptors and statistical features, preserving critical
3D geometric details. Second, we propose a novel hierarchical gated mamba
fusion (GM-Fusion) architecture that substitutes an expensive transformer with
a highly efficient, spatially-aware state-space model (SSM). Our core BEV-SSM
leverages directional sequencing and adaptive fusion mechanisms to capture
long-range dependencies with linear complexity, while explicitly respecting the
unique spatial properties of the driving scene. Extensive experiments on the
challenging NAVSIM benchmark demonstrate that GMF-Drive achieves a new
state-of-the-art performance, significantly outperforming DiffusionDrive.
Comprehensive ablation studies validate the efficacy of each component,
demonstrating that task-specific SSMs can surpass a general-purpose transformer
in both performance and efficiency for autonomous driving.
comment: 7 pages, 4 figures
♻ ☆ OSMa-Bench: Evaluating Open Semantic Mapping Under Varying Lighting Conditions
Open Semantic Mapping (OSM) is a key technology in robotic perception,
combining semantic segmentation and SLAM techniques. This paper introduces a
dynamically configurable and highly automated LLM/LVLM-powered pipeline for
evaluating OSM solutions called OSMa-Bench (Open Semantic Mapping Benchmark).
The study focuses on evaluating state-of-the-art semantic mapping algorithms
under varying indoor lighting conditions, a critical challenge in indoor
environments. We introduce a novel dataset with simulated RGB-D sequences and
ground truth 3D reconstructions, facilitating the rigorous analysis of mapping
performance across different lighting conditions. Through experiments on
leading models such as ConceptGraphs, BBQ and OpenScene, we evaluate the
semantic fidelity of object recognition and segmentation. Additionally, we
introduce a Scene Graph evaluation method to analyze the ability of models to
interpret semantic structure. The results provide insights into the robustness
of these models, forming future research directions for developing resilient
and adaptable robotic systems. Project page is available at
https://be2rlab.github.io/OSMa-Bench/.
comment: Project page: https://be2rlab.github.io/OSMa-Bench/
♻ ☆ Frequency Point Game Environment for UAVs via Expert Knowledge and Large Language Model
Unmanned Aerial Vehicles (UAVs) have made significant advancements in
communication stability and security through techniques such as frequency
hopping, signal spreading, and adaptive interference suppression. However,
challenges remain in modeling spectrum competition, integrating expert
knowledge, and predicting opponent behavior. To address these issues, we
propose UAV-FPG (Unmanned Aerial Vehicle - Frequency Point Game), a
game-theoretic environment model that simulates the dynamic interaction between
interference and anti-interference strategies of opponent and ally UAVs in
communication frequency bands. The model incorporates a prior expert knowledge
base to optimize frequency selection and employs large language models for path
planning, simulating a "strong adversary". Experimental results highlight the
effectiveness of integrating the expert knowledge base and the large language
model, with the latter significantly improving path planning in dynamic
scenarios through iterative interactions, outperforming fixed-path strategies.
UAV-FPG provides a robust platform for advancing anti-jamming strategies and
intelligent decision-making in UAV communication systems.
♻ ☆ Gait in Eight: Efficient On-Robot Learning for Omnidirectional Quadruped Locomotion
On-robot Reinforcement Learning is a promising approach to train
embodiment-aware policies for legged robots. However, the computational
constraints of real-time learning on robots pose a significant challenge. We
present a framework for efficiently learning quadruped locomotion in just 8
minutes of raw real-time training utilizing the sample efficiency and minimal
computational overhead of the new off-policy algorithm CrossQ. We investigate
two control architectures: Predicting joint target positions for agile,
high-speed locomotion and Central Pattern Generators for stable, natural gaits.
While prior work focused on learning simple forward gaits, our framework
extends on-robot learning to omnidirectional locomotion. We demonstrate the
robustness of our approach in different indoor and outdoor environments.
♻ ☆ What Foundation Models can Bring for Robot Learning in Manipulation : A Survey
Dingzhe Li, Yixiang Jin, Yuhao Sun, Yong A, Hongze Yu, Jun Shi, Xiaoshuai Hao, Peng Hao, Huaping Liu, Fuchun Sun, Jianwei Zhang, Bin Fang
The realization of universal robots is an ultimate goal of researchers.
However, a key hurdle in achieving this goal lies in the robots' ability to
manipulate objects in their unstructured surrounding environments according to
different tasks. The learning-based approach is considered an effective way to
address generalization. The impressive performance of foundation models in the
fields of computer vision and natural language suggests the potential of
embedding foundation models into manipulation tasks as a viable path toward
achieving general manipulation capability. However, we believe achieving
general manipulation capability requires an overarching framework akin to auto
driving. This framework should encompass multiple functional modules, with
different foundation models assuming distinct roles in facilitating general
manipulation capability. This survey focuses on the contributions of foundation
models to robot learning for manipulation. We propose a comprehensive framework
and detail how foundation models can address challenges in each module of the
framework. What's more, we examine current approaches, outline challenges,
suggest future research directions, and identify potential risks associated
with integrating foundation models into this domain.
♻ ☆ Edge-Based Multimodal Sensor Data Fusion with Vision Language Models (VLMs) for Real-time Autonomous Vehicle Accident Avoidance
Autonomous driving (AD) systems relying solely on onboard sensors may fail to
detect distant or obstacle hazards, potentially causing preventable collisions;
however, existing transformer-based Vehicle-to-Everything (V2X) approaches,
which mitigate AD sensing limitations, either lack effective multimodal fusion
and reasoning or struggle to meet real-time performance requirements under
complex, high-dimensional traffic conditions. This paper proposes the Real-time
Edge-based Autonomous Co-pilot Trajectory planner (REACT), a V2X-integrated
trajectory optimization framework for AD based on a fine-tuned lightweight
Vision-Language Model (VLM). REACT integrates infrastructure-provided hazard
alerts with onboard sensor data, capturing intricate surrounding traffic
dynamics and vehicle intents through visual embeddings, interpreting precise
numerical data from symbolic inputs, and employing contextual reasoning to
generate optimized, safety-oriented trajectories. To ensure robust real-time
deployment on edge devices, REACT innovatively employs Residual Trajectory
Fusion (RTF) design and specialized edge-adaptation strategies to reduce model
complexity and improve inference efficiency. Evaluated on the DeepAccident
benchmark, REACT achieves state-of-the-art performance, a 77% collision rate
reduction, a 48.2% Video Panoptic Quality (VPQ), and a 0.57-second inference
latency on the Jetson AGX Orin. Ablation studies validate the contribution of
each input, module, and edge adaptation strategy. These results highlight the
effectiveness of lightweight VLMs in enabling real-time cooperative planning on
edge platforms and underscore the potential of language-guided contextual
reasoning for improving traffic safety and responsiveness.
comment: 24 pages, 6 tables, 7 figures
♻ ☆ UnrealZoo: Enriching Photo-realistic Virtual Worlds for Embodied AI ICCV 2025
We introduce UnrealZoo, a collection of over 100 photo-realistic 3D virtual
worlds built on Unreal Engine, designed to reflect the complexity and
variability of open-world environments. We also provide a rich variety of
playable entities, including humans, animals, robots, and vehicles for embodied
AI research. We extend UnrealCV with optimized APIs and tools for data
collection, environment augmentation, distributed training, and benchmarking.
These improvements achieve significant improvements in the efficiency of
rendering and communication, enabling advanced applications such as multi-agent
interactions. Our experimental evaluation across visual navigation and tracking
tasks reveals two key insights: 1) environmental diversity provides substantial
benefits for developing generalizable reinforcement learning (RL) agents, and
2) current embodied agents face persistent challenges in open-world scenarios,
including navigation in unstructured terrain, adaptation to unseen
morphologies, and managing latency in the close-loop control systems for
interacting in highly dynamic objects. UnrealZoo thus serves as both a
comprehensive testing ground and a pathway toward developing more capable
embodied AI systems for real-world deployment.
comment: ICCV 2025 (Highlight), Project page: http://unrealzoo.site/
♻ ☆ ReNiL: Relative Neural Inertial Locator with Any-Scale Bayesian Inference
Pedestrian inertial localization is key for mobile and IoT services because
it provides infrastructure-free positioning. Yet most learning-based methods
depend on fixed sliding-window integration, struggle to adapt to diverse motion
scales and cadences, and yield inconsistent uncertainty, limiting real-world
use. We present ReNiL, a Bayesian deep-learning framework for accurate,
efficient, and uncertainty-aware pedestrian localization. ReNiL introduces
Inertial Positioning Demand Points (IPDPs) to estimate motion at contextually
meaningful waypoints instead of dense tracking, and supports inference on IMU
sequences at any scale so cadence can match application needs. It couples a
motion-aware orientation filter with an Any-Scale Laplace Estimator (ASLE), a
dual-task network that blends patch-based self-supervision with Bayesian
regression. By modeling displacements with a Laplace distribution, ReNiL
provides homogeneous Euclidean uncertainty that integrates cleanly with other
sensors. A Bayesian inference chain links successive IPDPs into consistent
trajectories. On RoNIN-ds and a new WUDataset covering indoor and outdoor
motion from 28 participants, ReNiL achieves state-of-the-art displacement
accuracy and uncertainty consistency, outperforming TLIO, CTIN, iMoT, and RoNIN
variants while reducing computation. Application studies further show
robustness and practicality for mobile and IoT localization, making ReNiL a
scalable, uncertainty-aware foundation for next-generation positioning.
comment: This work has been submitted to the IEEE for possible publication
♻ ☆ Speech to Reality: On-Demand Production using Natural Language, 3D Generative AI, and Discrete Robotic Assembly
We present a system that transforms speech into physical objects using 3D
generative AI and discrete robotic assembly. By leveraging natural language
input, the system makes design and manufacturing more accessible to individuals
without expertise in 3D modeling or robotic programming. While current
generative AI models can produce a wide range of 3D digital assets,
AI-generated meshes are not directly suitable for robotic fabrication and do
not account for fabrication constraints. To address this, we contribute a
workflow that integrates natural language processing, 3D generative AI, and
discrete robotic assembly. The system automatically analyzes and modifies
AI-generated geometry to meet physical constraints, such as component count,
overhangs, and connectivity, and produces a feasible robotic assembly sequence
and toolpath. The results are demonstrated through the assembly of various
objects, ranging from chairs to shelves, which are prompted via speech and
realized within 5 minutes using a robotic arm.
comment: This work has been submitted for possible publication. An updated
version will replace this version when available
♻ ☆ Human-Robot Interaction Conversational User Enjoyment Scale (HRI CUES)
Bahar Irfan, Jura Miniota, Sofia Thunberg, Erik Lagerstedt, Sanna Kuoppamäki, Gabriel Skantze, André Pereira
Understanding user enjoyment is crucial in human-robot interaction (HRI), as
it can impact interaction quality and influence user acceptance and long-term
engagement with robots, particularly in the context of conversations with
social robots. However, current assessment methods rely solely on self-reported
questionnaires, failing to capture interaction dynamics. This work introduces
the Human-Robot Interaction Conversational User Enjoyment Scale (HRI CUES), a
novel 5-point scale to assess user enjoyment from an external perspective (e.g.
by an annotator) for conversations with a robot. The scale was developed
through rigorous evaluations and discussions among three annotators with
relevant expertise, using open-domain conversations with a companion robot that
was powered by a large language model, and was applied to each conversation
exchange (i.e. a robot-participant turn pair) alongside overall interaction. It
was evaluated on 25 older adults' interactions with the companion robot,
corresponding to 174 minutes of data, showing moderate to good alignment
between annotators. Although the scale was developed and tested in the context
of older adult interactions with a robot, its basis in general and
non-task-specific indicators of enjoyment supports its broader applicability.
The study further offers insights into understanding the nuances and challenges
of assessing user enjoyment in robot interactions, and provides guidelines on
applying the scale to other domains and populations. The dataset is available
online.
comment: Published in IEEE Transactions on Affective Computing on 18 July
2025. Personal use of this material is permitted. Permission from IEEE must
be obtained for all other uses, in any current or future media
♻ ☆ Anticipating Degradation: A Predictive Approach to Fault Tolerance in Robot Swarms
An active approach to fault tolerance is essential for robot swarms to
achieve long-term autonomy. Previous efforts have focused on responding to
spontaneous electro-mechanical faults and failures. However, many faults occur
gradually over time. Waiting until such faults have manifested as failures
before addressing them is both inefficient and unsustainable in a variety of
scenarios. This work argues that the principles of predictive maintenance, in
which potential faults are resolved before they hinder the operation of the
swarm, offer a promising means of achieving long-term fault tolerance. This is
a novel approach to swarm fault tolerance, which is shown to give a comparable
or improved performance when tested against a reactive approach in almost all
cases tested.
♻ ☆ Context-based Motion Retrieval using Open Vocabulary Methods for Autonomous Driving
Autonomous driving systems must operate reliably in safety-critical
scenarios, particularly those involving unusual or complex behavior by
Vulnerable Road Users (VRUs). Identifying these edge cases in driving datasets
is essential for robust evaluation and generalization, but retrieving such rare
human behavior scenarios within the long tail of large-scale datasets is
challenging. To support targeted evaluation of autonomous driving systems in
diverse, human-centered scenarios, we propose a novel context-aware motion
retrieval framework. Our method combines Skinned Multi-Person Linear
(SMPL)-based motion sequences and corresponding video frames before encoding
them into a shared multimodal embedding space aligned with natural language.
Our approach enables the scalable retrieval of human behavior and their context
through text queries. This work also introduces our dataset WayMoCo, an
extension of the Waymo Open Dataset. It contains automatically labeled motion
and scene context descriptions derived from generated pseudo-ground-truth SMPL
sequences and corresponding image data. Our approach outperforms
state-of-the-art models by up to 27.5% accuracy in motion-context retrieval,
when evaluated on the WayMoCo dataset.
comment: Project page: https://iv.ee.hm.edu/contextmotionclip/; This work has
been submitted to the IEEE for possible publication
♻ ☆ Multi-Keypoint Affordance Representation for Functional Dexterous Grasping
Fan Yang, Dongsheng Luo, Wenrui Chen, Jiacheng Lin, Junjie Cai, Kailun Yang, Zhiyong Li, Yaonan Wang
Functional dexterous grasping requires precise hand-object interaction, going
beyond simple gripping. Existing affordance-based methods primarily predict
coarse interaction regions and cannot directly constrain the grasping posture,
leading to a disconnection between visual perception and manipulation. To
address this issue, we propose a multi-keypoint affordance representation for
functional dexterous grasping, which directly encodes task-driven grasp
configurations by localizing functional contact points. Our method introduces
Contact-guided Multi-Keypoint Affordance (CMKA), leveraging human grasping
experience images for weak supervision combined with Large Vision Models for
fine affordance feature extraction, achieving generalization while avoiding
manual keypoint annotations. Additionally, we present a Keypoint-based Grasp
matrix Transformation (KGT) method, ensuring spatial consistency between hand
keypoints and object contact points, thus providing a direct link between
visual perception and dexterous grasping actions. Experiments on public
real-world FAH datasets, IsaacGym simulation, and challenging robotic tasks
demonstrate that our method significantly improves affordance localization
accuracy, grasp consistency, and generalization to unseen tools and tasks,
bridging the gap between visual affordance learning and dexterous robotic
manipulation. The source code and demo videos are publicly available at
https://github.com/PopeyePxx/MKA.
comment: Accepted to IEEE Robotics and Automation Letters (RA-L). The source
code and demo videos are publicly available at
https://github.com/PopeyePxx/MKA
♻ ☆ A simulation framework for autonomous lunar construction work
We present a simulation framework for lunar construction work involving
multiple autonomous machines. The framework supports modelling of construction
scenarios and autonomy solutions, execution of the scenarios in simulation, and
analysis of work time and energy consumption throughout the construction
project. The simulations are based on physics-based models for contacting
multibody dynamics and deformable terrain, including vehicle-soil interaction
forces and soil flow in real time. A behaviour tree manages the operational
logic and error handling, which enables the representation of complex
behaviours through a discrete set of simpler tasks in a modular hierarchical
structure. High-level decision-making is separated from lower-level control
algorithms, with the two connected via ROS2. Excavation movements are
controlled through inverse kinematics and tracking controllers. The framework
is tested and demonstrated on two different lunar construction scenarios that
involve an excavator and dump truck with actively controlled articulated
crawlers.
comment: 13 pages, 16 figures
♻ ☆ Hypergraph-based Motion Generation with Multi-modal Interaction Relational Reasoning
The intricate nature of real-world driving environments, characterized by
dynamic and diverse interactions among multiple vehicles and their possible
future states, presents considerable challenges in accurately predicting the
motion states of vehicles and handling the uncertainty inherent in the
predictions. Addressing these challenges requires comprehensive modeling and
reasoning to capture the implicit relations among vehicles and the
corresponding diverse behaviors. This research introduces an integrated
framework for autonomous vehicles (AVs) motion prediction to address these
complexities, utilizing a novel Relational Hypergraph Interaction-informed
Neural mOtion generator (RHINO). RHINO leverages hypergraph-based relational
reasoning by integrating a multi-scale hypergraph neural network to model
group-wise interactions among multiple vehicles and their multi-modal driving
behaviors, thereby enhancing motion prediction accuracy and reliability.
Experimental validation using real-world datasets demonstrates the superior
performance of this framework in improving predictive accuracy and fostering
socially aware automated driving in dynamic traffic scenarios. The source code
is publicly available at
https://github.com/keshuw95/RHINO-Hypergraph-Motion-Generation.
♻ ☆ MEReQ: Max-Ent Residual-Q Inverse RL for Sample-Efficient Alignment from Intervention
Yuxin Chen, Chen Tang, Jianglan Wei, Chenran Li, Ran Tian, Xiang Zhang, Wei Zhan, Peter Stone, Masayoshi Tomizuka
Aligning robot behavior with human preferences is crucial for deploying
embodied AI agents in human-centered environments. A promising solution is
interactive imitation learning from human intervention, where a human expert
observes the policy's execution and provides interventions as feedback.
However, existing methods often fail to utilize the prior policy efficiently to
facilitate learning, thus hindering sample efficiency. In this work, we
introduce MEReQ (Maximum-Entropy Residual-Q Inverse Reinforcement Learning),
designed for sample-efficient alignment from human intervention. Instead of
inferring the complete human behavior characteristics, MEReQ infers a residual
reward function that captures the discrepancy between the human expert's and
the prior policy's underlying reward functions. It then employs Residual
Q-Learning (RQL) to align the policy with human preferences using this residual
reward function. Extensive evaluations on simulated and real-world tasks
demonstrate that MEReQ achieves sample-efficient policy alignment from human
intervention.
♻ ☆ IRL-VLA: Training an Vision-Language-Action Policy via Reward World Model
Anqing Jiang, Yu Gao, Yiru Wang, Zhigang Sun, Shuo Wang, Yuwen Heng, Hao Sun, Shichen Tang, Lijuan Zhu, Jinhao Chai, Jijun Wang, Zichong Gu, Hao Jiang, Li Sun
Vision-Language-Action (VLA) models have demonstrated potential in autonomous
driving. However, two critical challenges hinder their development: (1)
Existing VLA architectures are typically based on imitation learning in
open-loop setup which tends to capture the recorded behaviors in the dataset,
leading to suboptimal and constrained performance, (2) Close-loop training
relies heavily on high-fidelity sensor simulation, where domain gaps and
computational inefficiencies pose significant barriers. In this paper, we
introduce IRL-VLA, a novel close-loop Reinforcement Learning via
\textbf{I}nverse \textbf{R}einforcement \textbf{L}earning reward world model
with a self-built VLA approach. Our framework proceeds in a three-stage
paradigm: In the first stage, we propose a VLA architecture and pretrain the
VLA policy via imitation learning. In the second stage, we construct a
lightweight reward world model via inverse reinforcement learning to enable
efficient close-loop reward computation. To further enhance planning
performance, finally, we design specialized reward world model guidence
reinforcement learning via PPO(Proximal Policy Optimization) to effectively
balance the safety incidents, comfortable driving, and traffic efficiency. Our
approach achieves state-of-the-art performance in NAVSIM v2 end-to-end driving
benchmark, 1st runner up in CVPR2025 Autonomous Grand Challenge. We hope that
our framework will accelerate VLA research in close-loop autonomous driving.
comment: 9 pagres, 2 figures