Policy Improvement using Human Interventions

DAgger

Fig 1. pi0.5 after human intervention and DAgger.
Both lid pickup and placement improved in this example.
  • Human Intervention: Imitation learning (behavioral cloning) can work surprisingly well for initial policy training, but can result in fragile policies. This is because the distribution of states encountered by real robots includes many states/situations not in the imitation learning dataset, D0. In other words, the robot following the imitation learning policy, p0, may find itself on a trajectory different from any it was trained on. This can cause the robot to fail to accomplish its task. One solution is to have a human intervene when the robot, following p0, is about to fail. The human intervenes by stopping the robot mid-trajectory and then teleoperating the robot. Such episodes with policy rollout plus human intervention can be recorded and saved as a new dataset D1. This new dataset contains states/situations not in the original imitation learning dataset D0, so it can be used to teach the robot how to behave in a greater variety of situations. To improve the initial policy, p0, one can combine D0 and D1 and retrain to obtain a more robust policy p1. This process can then be iterated.
  • DAgger: In our case, we intervene only once, when the robot is about to fail, and then teleoperate to the end of the episode. We also train the next policy pi+1 by initializing with the policy pi , and training on the combination of all datasets: \(\sum_{j=0}^{i+1} D_j\). This is inspired by the DAgger paper in which the authors prove the benefit of using the current policy to explore the state space while having an expert replace the current policy actions in that space with expert actions. In DAgger, a large fraction of actions are replaced in this way which creates a very rich dataset with correct behavior over a wide range of states/situations. Also, in the original DAgger approach the new policy pi+1 is trained from scratch. However, in practice it is common to intervene only selectively and to continue training starting with the previous policy, which is what we do here.
  • Place lid on pan: We applied this approach to improve an imitation learning pi0.5 policy trained on ANRedlich/trossen_ai_stationary_place_lids_04, which has 50 examples created using teleoperation. We then added another 50 examples using human intervention. The combined dataset is ANRedlich/trossen_ai_stationary_place_lids_13. There are two types of error the human corrected. First, as shown in 2b and 3b the robot failed to pick up the lid. By intervening just before this failure, the lid is picked up and then placed on the pan. Second, the robot picks up the lid correctly but does not place it well on the pan, as shown in Fig 4b, in which case the human intervenes just before the lid is misplaced. Starting from the original pi0.5 policy, which had been trained for 40K steps, training was restarted and continued for another 5K steps on the combined 100 episode dataset (+10K steps was slightly worse). The improved performance is shown in Fig 2a, 3a, and 4a. In one test of multiple lids and pans, the robot went from picking up 66% of lids to picking up 95%. Placement went from 15% to 30%, and lids were placed more closely, though still not perfectly, about 60% of the time. Only one iteration was performed, but more are planned.
  • Place beads on a string: We also applied this approach to improve the performance of our pi0.5 policy trained on the teleoperation dataset ANRedlich/trossen_ai_stationary_place_bead_on_string_10, which contains 50 episodes. Two iterations of policy improvement were performed, adding 50 episodes each iteration, leading to the augmented datasets ..._bead_on_string_14 and ..._bead_on_string_15. The original policy was trained for 40K steps, while iteration 1 added 5K steps and iteration 2 added another 10K steps of training. The improved policy is shown running in Fig 5. In one experiment, the percentage of full task completions went from 20% -> 50%. Morever, the percentage of times the robot was able to place the bead on the string -- and then fail -- improved from 20% -> 80%. In addition, there was a large improvement in the percentage of times the robot picked up the bead, going from 50% -> 90%.
  • Close tie wrap: A pi0.5 policy was trained for 40K steps on the dataset ANRedlich/trossen_ai_stationary_close_tie_wrap_16, which has 50 examples. This policy was never succesful at inserting the tie wrap tip into the ratchet head, although it was able to get close about 40% of the time. One set of 50 examples of human interventions was added to the original dataset to produce the combined dataset ..._close_tie_wrap_17. The origial 40K step policy was trained on the combined dataset for an additional 15K steps producing an improved policy. The improved policy was then able to insert the tip into the head 25% of the time, see Fig 6! It was also able to get very close an additional 50% of the time!
Fig 2a. pi0.5 after human intervention and DAgger.
Lid is correctly picked up, compare to Fig 2b.
Fig 2b. pi0.5 initial imitation learning only.
Fails to pick up lid.
Fig 3a. pi0.5 after human intervention and DAgger.
Lid is correctly picked up, compare to Fig 3b.
Fig 3b. pi0.5 initial imitation learning only.
Fails to pick up lid.
Fig 4a. pi0.5 after human intervention and DAgger.
Lid placement is improved compared to Fig 4b.
Fig 4b. pi0.5 initial imitation learning only.
Fails to place lid.
Fig 5. pi0.5 after 2 iterations of DAgger.
Policy both picks up and places the bead more reliably.
Fig 6. pi0.5 after 1 iterations of DAgger.
Policy both grabs the tip and inserts it more reliably.
Back to top

Human Intervention Implementation

Fig 7. Policy rollout with human intervention.

Fig 7 shows our implementation of policy rollout with human intervention. This is done in the examples/trossen_ai/record.py function in our openpi fork, which also saves episodes in the required lerobot dataset format. This record function runs the current pi0.5 (or another policy) up until the down arrow key is pressed, at which point the robot arm is frozen. In Fig 7, this happens just before the robot attempts to pick up the lid, which it would fail to do. Next the leader arm is sent to the same position as the frozen follower arm. Once the leader arm is in place, pressing the down arrow key again puts the robot arms into teleoperation mode, and the person completes the episode. Notice, Fig 8, that the recorded dataset video smoothly splices together the rollout and teleoperated trajectories. The record.py script also implements 'early exit', 'rerecord episode', and 'stop recording' as in control_robot.py in lerobot.

Fig 8. Left wrist camera dataset video for above intervention.
Back to top