Real-World Offline Reinforcement Learning from Vision Language Model Feedback

Sreyas Venkataraman1, 3*, Yufei Wang1*, Ziyu Wang2, Navin Sriram Ravie4,Zackory Erickson1†, David Held1†
1The Robotics Institute, Carnegie Mellon University, 2IIIS, Tsinghua University 3Indian Institute of Technology, Kharagpur 4Indian Institute of Technology, Madras
*Equal Contribution Equal Advising

Combining preference based reward learning from Vision-Language Models with offline RL for policy learning from unlabeled datasets

Abstract

Offline reinforcement learning can enable policy learning from pre-collected, sub-optimal datasets without online interactions. This makes it ideal for real-world robots and safety-critical scenarios, where collecting online data or expert demonstrations is slow, costly, and risky. However, most existing offline RL works assume the dataset is already labeled with the task rewards, a process that often requires significant human effort, especially when ground-truth states are hard to ascertain (e.g., in the real-world). In this paper, we build on prior work, specifically RL-VLM-F, and propose a novel system that automatically generates reward labels for offline datasets using preference feedback from a vision-language model and a text description of the task. Our method then learns a policy using offline RL with the reward-labeled dataset. We demonstrate the system’s applicability to a complex real-world robot-assisted dressing task, where we first learn a reward function using a vision-language model on a sub-optimal offline dataset, and then we use the learned reward to employ Implicit Q learning to develop an effective dressing policy. Our method also performs well in simulation tasks involving the manipulation of rigid and deformable objects, and significantly outperform baselines such as behavior cloning and inverse RL. In summary, we propose a new system that enables automatic reward labeling and policy learning from unlabeled, sub-optimal offline datasets.

System

Methods Figure



Top: Our system, Offline RL-VLM-F, combines RL-VLM-F [1] with offline RL for effective policy learning from unlabeled datasets. Given an unlabeled dataset without rewards, it first samples observation pairs, and queries a Vision Language Model for preferences given a text description of the task. Using the preference labels, it then learns a reward model via preference-based reward learning. The learned reward model is utilized to label the entire offline dataset. Finally, it performs offline RL with the labeled dataset to learn a policy that solves the task.

Bottom: We follow the same VLM querying process as in RL-VLM-F. It consists of two stages: the first is an analysis stage that asks the VLM to analyze and compare the two images; the second is the labeling stage, where the VLM generates the preference labels based on its own analysis from the first stage and the task description.

Environments

Environments

Quantitative Results

Simulation Results

Task Dataset IQL-GT Reward Offline RLVLMF (Ours) IQL-Average Reward Simple Behavioral Cloning Diffusion Policy GAIL
Open Drawer Random 0.99 0.91 0.19 0.04 0.08 0.03
Medium 0.95 0.85 0.84) 0.88 0.00 0.00
Expert 0.99 0.99 1.00 1.00 1.00 0.07
Soccer Random 0.30 0.23 0.11 0.11 0.13 0.01
Medium 0.19 0.15 0.19 0.20 0.10 0.00
Expert 0.42 0.42 0.41 0.41 0.49 0.04
Cartpole Random -93.54 -212.87 -2537.62 -1816.98 -2063.65 -1683.31
Medium -98.85 -144.56 -236.77 -165.16 -1875.05 -431.31
Expert -83.93 -78.77 -81.46 -79.19 -1423.32 -433.64
Straighten Rope Random 20.93 16.70 8.71 12.73 12.01 11.75
Medium 14.34 14.56 14.15 14.17 20.27 19.88
Expert 20.58 20.54 20.46 20.54 20.27 12.33

Real World Results

Real World Result 1

Dressed ratio of our method and DP3 on the ViperX 300 S arm. As shown, our method achieves higher dressed ratios on all three garments.

Real World Result 2

Dressed ratio of our method and DP3 on the manikin arm. Both methods achieve similarly high performance, as the manikin arm better represents a real person’s arm and holds an easier pose for dressing.

Simulation Experiments

We show below the rollouts for four environments on the random dataset and compare it with the other baselines.

Drawer Task

Drawer - IQL GT Reward

IQL GT Reward

Drawer - IQL Offline RL-VLMF (Ours)

Offline RL-VLMF (Ours)

Drawer - IQL Average Reward

IQL Average Reward

Drawer - Behavioral Cloning

Behavioral Cloning (BC)

Drawer - Diffusion Policy

Diffusion Policy

Cartpole Task

Cartpole - IQL GT Reward

IQL GT Reward

Cartpole - IQL Offline RL-VLMF (Ours)

Offline RL-VLMF (Ours)

Cartpole - IQL Average Reward

IQL Average Reward

Cartpole - Behavioral Cloning

Behavioral Cloning (BC)

Cartpole - Diffusion Policy

Diffusion Policy

Rope Task

Rope - IQL GT Reward

IQL GT Reward

Rope - IQL Offline RL-VLMF (Ours)

Offline RL-VLMF (Ours)

Rope - IQL Average Reward

IQL Average Reward

Rope - Behavioral Cloning

Behavioral Cloning (BC)

Rope - Diffusion Policy

Diffusion Policy

Soccer Task

Soccer - IQL GT Reward

IQL GT Reward

Soccer - IQL Offline RL-VLMF (Ours)

Offline RL-VLMF (Ours)

Soccer - IQL Average Reward

IQL Average Reward

Soccer - Behavioral Cloning

Behavioral Cloning (BC)

Soccer - Diffusion Policy

Diffusion Policy

Real-World Experiments

Viper Experiments - Videos

Viper - Gown (Ours)

Viper - Gown (DP3)

Viper - Green Jacket (Ours)

Viper - Green Jacket (DP3)

Viper - Purple Jacket (Ours)

Viper - Purple Jacket (DP3)

Manikin Experiments

Manikin - Gown (Ours)

Manikin - Gown (DP3)

Manikin - Green Jacket (Ours)

Manikin - Green Jacket (DP3)

Manikin - Purple Jacket (Ours)

Manikin - Purple Jacket (DP3)