Policy Improvement using Human Interventions

Results at a Glance

In our experience, imitation learning for high-dexterity tasks hits a wall, and neither larger datasets nor longer training seem to help. Human interventions at failure points during policy rollouts (DAgger-style), followed by some additional training steps, have been surprisingly effective!
After one iteration of policy improvement using human interventions, on a dataset of lids and pans, a pi0.5 policy was able to pick up even small lids 95% of the time. It also placed lids correctly on pans 30% of the time, and got close an additional 65% of the time. This compares to 66% pickup, 15% correct placement, and 50% close placement without policy improvement.
After two iterations of policy improvement, a pi0.5 policy was able to place a bead on a string 50% of the time, compared to 10-20% of the time without policy improvement.
After only one iteration of policy improvement, a pi0.5 policy was able to insert a tie wrap tip 25% of the time, and get close another 50% of the time. Without interventions, the tie wrap tip was never inserted successfully.
For every task, only 10-15K additional training steps on the augmented datasets were needed following 40K steps of initial pi0.5 training. So this approach was very training-efficient, with no need to retrain from scratch!

Results Table

Task	pi0.5	pi0.5 + PI1	pi0.5 + PI2
Tie-wrap inserted	0%	25%	—
Tie-wrap in+close	40%	75%	—
Bead on string	10-20%	10-50%	50%
Bead on+close	50%	90-100%	100%
Place lids on pans	15%	30%	—
Place on+close	65%	95%	—
Pickup lids	66%	95%	—

Approximate single-task success rates on the real Trossen Stationary AI Robot. in+close and on+close combine episodes where the task was completed and those where the robot got close. — indicates the combination was not tested. PI1,2 denotes 1 or 2 iterations of policy improvement with human interventions.

Resources: openpi, lerobot · Implementation details · Notes & Optimizations · Datasets

DAgger

Fig 1. pi0.5 after human intervention and DAgger.

Both lid pickup and placement improved in this example.

Human Intervention: Imitation learning (behavioral cloning) can work surprisingly well for initial policy training, but can result in fragile policies. This is because the distribution of states encountered by real robots includes many states/situations not in the imitation learning dataset D₀. In other words, the robot following the imitation learning policy, p₀, may find itself on a trajectory different from any it was trained on. This can cause the robot to fail to accomplish its task. One solution is to have a human intervene when the robot, following p₀, is about to fail. The human intervenes by stopping the robot mid-trajectory and then continues by teleoperating the robot. Such episodes with policy rollout plus human intervention can be recorded and saved as a new dataset D₁. This new dataset contains states/situations not in the original imitation learning dataset D₀, so it can be used to teach the robot how to behave in a greater variety of situations. To improve the initial policy, p₀, one can combine D₀ and D₁ and resume training p₀ to obtain a more robust policy p₁. This process can then be iterated.
DAgger: In our case, we intervene only once, when the robot is about to fail, and then teleoperate to the end of the episode. We also train the next policy p_i+1 by initializing with the policy p_i , and training on the combination of all datasets: \(\sum_{j=0}^{i+1} D_j\). This is inspired by the DAgger paper in which the authors prove the benefit of using the current policy to explore the state space while having an expert replace the current policy actions in that space with expert actions. In DAgger, a large fraction of actions are replaced in this way which creates a very rich dataset with correct behavior over a wide range of states/situations. Also, in the original DAgger approach the new policy p_i+1 is trained from scratch. However, in practice it is common to intervene only selectively and to continue training starting with the previous policy, which is what we do here.
Place lid on pan, before and after: Figs 2,3,4. We applied this approach to improve an imitation learning pi0.5 policy trained on our place-lids dataset D₀, which has 50 episodes created using teleoperation. We then added another 50 episodes using human interventions, to create a combined dataset D₀+D₁. There are two types of error the human corrected. First, as shown in Fig 2b and Fig 3b the robot may fail to pick up the lid. By intervening just before this failure, the lid is picked up and then placed on the pan. Second, the robot picks up the lid correctly but does not place it well on the pan, as shown in Fig 4b, in which case the human intervenes just before the lid is misplaced. Starting from the original pi0.5 policy, which had been trained for 40K steps, training was restarted and continued for another 10K steps on D₀+D₁. The improved performance is shown in Figs 2a, 3a, and 4a. In one test of multiple lids and pans, the robot went from picking up 66% of the lids to picking up 95%. Placement of lids that stay on their pans went from 15% to 30%. Placement of lids that either stay on their pans or that are close to staying on went from 65% to 95%. Only one iteration was performed, but more are planned.
Place beads on a string: Fig 5. We also applied this approach to improve the performance of our pi0.5 policy trained on the teleoperation bead-on-a-string dataset D₀, which contains 50 episodes. To begin, 50 episodes of human interventions were performed to build dataset D₀+D₁. The initial policy, which had been trained for 40K steps, was trained for an additional 5K steps on D₀+D₁. This process was repeated to build D₂ and train for another 10K steps on D₀+D₁+D₂. Two types of interventions were performed: one to help the robot pick up the bead, and a second to help the robot place the bead on the string. The improved policy is shown running in Fig 5. In one experiment, the percentage of full task completions went from 20% to 50%. Moreover, the percentage of times the robot either placed the bead correctly or got close improved from 50% to 100%. In addition, there was a large improvement in the percentage of times the robot picked up the bead, going from 50% to 90%.
Close tie wrap: Fig 6. A pi0.5 policy was trained for 40K steps on our tie-wrap dataset D₀, which has 50 episodes. This policy was never successful at inserting the tie wrap tip into the head, although it was able to get close about 40% of the time. One set of 50 episodes with human interventions was added to build D₀+D₁. The original 40K step policy was trained on D₀+D₁ for an additional 15K steps. Two types of interventions were performed, one to improve how the right gripper grabs the tie wrap, and a second to fix misalignment of the tip just before insertion. The improved policy was then able to insert the tip into the head 25% of the time, see Fig 6! It was able to insert or get close 75% of the time. More iterations are planned.

Fig 2a. pi0.5 after human intervention and DAgger.

Lid is correctly picked up, compare to Fig 2b.

Fig 2b. pi0.5 initial imitation learning only.

Fails to pick up lid.

Fig 3a. pi0.5 after human intervention and DAgger.

Lid is correctly picked up, compare to Fig 3b.

Fig 3b. pi0.5 initial imitation learning only.

Fails to pick up lid.

Fig 4a. pi0.5 after human intervention and DAgger.

Lid placement is improved compared to Fig 4b.

Fig 4b. pi0.5 initial imitation learning only.

Fails to place lid.

Fig 5. pi0.5 after 2 iterations of DAgger.

Policy both picks up and places the bead more reliably.

Fig 6. pi0.5 after 1 iteration of DAgger.

Policy both grabs the tip and inserts it more reliably.

Human Intervention Implementation

Fig 7. Policy rollout with human intervention.

Fig 7 shows our implementation of policy rollout with human intervention. This is done in the examples/trossen_ai/record.py function in our openpi fork, which also saves episodes in the required lerobot dataset format. This record function runs the current pi0.5 (or another policy) up until the down arrow key is pressed, at which point the robot arm is frozen. In Fig 7, this happens just before the robot attempts to pick up the lid, which it would fail to do. Next the leader arm is sent to the same position as the frozen follower arm. Once the leader arm is in place, pressing the down arrow key again puts the robot arms into teleoperation mode, and the person completes the episode. Notice, Fig 8, that the recorded dataset video smoothly splices together the rollout and teleoperated trajectories. The record.py script also implements 'early exit', 'rerecord episode', and 'stop recording' as in control_robot.py in lerobot.

Fig 8. Left wrist camera dataset video for above intervention.