High Dexterity

ACT

Fig 1. Pop lid off container.
ACT model trained on dataset trossen_ai_stationary_pop_lid_06.

One of the goals of the ACT algorithm and the Aloha robot, was to solve problems requiring significant dexterity. The ACT algorithm can solve single task problems by training from scratch with no robot pre-training. It does, on the other hand have some prior image understanding coming from its ResNet18. Here, we use the ACT algorithm as a baseline, and we find that it often does a pretty good job!

  • Successes:
    Pop lid: This task, see Fig 1, learned from trossen_ai_stationary_pop_lid_06, works well, but only if the "takeout" container is positioned carefully on the tabletop! The dataset did not have much position variety, so further experiments are planned. Also, the lid was very snug, so some crushing was necessary, even by a human using only two fingers, so again further experiments with better containers are planned. However, it does succeed!
    Transfer cube: This task, see Fig 2, for either a 20mm or 40mm cube, was easily learned by ACT from e.g. ANRedlich/trossen_ai_stationary_transfer_20mm_cube_01.
    Pour cup to cup: This task, see Fig 3, was easily learned by ACT from ANRedlich/trossen_ai_stationary_pour_box_05. It works for the same range of cup placements of ~2-3 inches as in the datasets.
    Place lids: The dataset ANRedlich/trossen_ai_stationary_place_lids_04 has many different pot and lid colors and shapes at many locations. ACT did surprisingly well, but was inconsistent. The dataset, however, is small relative to the object variety. See Fig 6 for examples of pi0 performing this task.
  • Failures:
    Multiple cube colors, sizes, and orientations: The ACT algorithm did not learn the task in dataset ANRedlich/trossen_ai_stationary_transfer_multi_cube_03. See Fig 5 for examples of pi0 performing this task. It may be that the number of examples needs increasing, but we suspect that there is just too much task variety for ACT.
    Place lids: As mentioned, ACT was inconsistent on this dataset, although more data might improve performance.
    Place beads on a string: See Fig 5, which shows pi0.5 succeeding at this task. An ACT model was trained on this task, but was not able to place the bead on the string. It was, however, able to pick up the string and move it from one gripper to the other.
  • Discussion: ACT seems to work well for tasks with limited task and environmental variety. Not sure if this is because our datasets are too small, or if this is a fundamental limitation of ACT. Also, ACT might benefit from policy improvement using human interventions, which improves pi0.5 performance significantly. This will require further experimentation.
Fig 2. Transfer 20mm cube.
ACT model trained on dataset trossen_ai_stationary_transfer_20mm_cube_01.
Fig 3. Pour little red cube from one cup to another.
ACT model trained on dataset trossen_ai_stationary_pour_box_05.
Back to top

pi0/pi0.5

Fig 4. pi0.5 policy for closing a tie wrap. 1 iteration of policy improvement.
Fig 5. pi0.5 policy for placing a bead on a string. 2 iterations of policy improvement.

pi0 and pi0.5 were designed to reason about robot tasks and respond to multiple prompts. Here, however, we ask only how precisely they can perform single high dexterity tasks.

  • Close tie wrap: Fig 4. A pi0.5 policy was trained for 40K steps on the dataset ANRedlich/trossen_ai_stationary_close_tie_wrap_16, which has 50 examples. This initial policy was never succesful at inserting the tie wrap tip into the ratchet head, although it was able to get close about 40% of the time. After one iteration of policy improvement, however, using an additional 15K steps of training, the policy was able to insert the tip into the head 25% of the time! It was also able to get very close 50% of the time! This is a difficult task even for a human teleoperator. It is also surprising that pi0.5 can 'see' the tie wrap tip and line it up with the tie wrap head, since its video resolution is scaled down to 224x224 pixels! It is possible that it is not using vision for this alignment, but we believe it is because it does not grab the tie wrap at the same spot every time, and it does seem to be making small scale alignment corrections. More experiments are needed.
  • Place beads on a string: Fig 5. A full fine tune pi0.5 policy was trained on trossen_ai_stationary_place_bead_on_string_10 for 40K steps using a single H100 rented on runpod.io. Training took about 24 hours. This policy was then able to place a small bead on a string about 20% of the time. Doubling the size of the dataset and re-training only improved performance a little. However, two iterations of policy improvement using human interventions gave a big boost in performance: 20% -> 50% for the full task, 20% -> 80% for the bead placed on the string but then dropped, 50% -> 90% for picking up the bead. This is a difficult task and the initial dataset was small -- 50 examples, 30 sec each -- so it is encouraging that pi0.5 can accomplish this task. As for the tie wrap, this is even more surprising given that the video input to pi0.5 is scaled down to 224x224 pixels. An ACT model trained on the same dataset was not able to accomplish the task!
  • Place lids: Fig 6. The dataset ANRedlich/ trossen_ai_stationary_place_lids_04 has 6 lids and 8 pots of multiple colors and shapes at many locations, but is small: 50 episodes, 12 min total. Our first attempt was LoRA training for 20K steps on our local RTX5090 for 16hours, with poor results, so we resumed training for another 20K steps and achieved good results for some of the lid/pot combos, including the small lid in Fig 6a which requires high accuracy (LoRA not shown). We then trained a pi0 model from scratch using full fine tuning for 20K steps on a H100PCIe remote gpu which took about 12hours. The results were somewhat improved and overall very encouraging, again given the dataset size vs complexity. The robot with pi0 model is able to pick up at least 50% of the lids and place and drop them crudely on the pots, and it comes very close to picking up the other 3 lids. We also trained a pi0.5 model for 40k steps which seems to perform slightly better than pi0, typically picking up 66% of the lids, but sometimes more depending on lid position. For even better results, one iteration of policy improvement using human interventions produced a pi0.5 policy that picks up the lids about 95% of the time. Lid placement is also improved, with lids staying on the pan (although not perfectly aligned) now 30% vs 15% before, and the total fraction of lids staying either on or close to on going from 66% to 95%.
  • Multiple cube colors, sizes, and orientations: Fig 7. We trained a pi0 policy for this small (50 examples, 12min total) but moderately difficult dataset, ANRedlich/ trossen_ai_stationary_transfer_multi_cube_03, which had failed to be learned by ACT. LoRA training was used for 10K steps, with batch_size=64, which took about 12 hours on a remote H100PCIe gpu at runpod.io. (We believe a 20K step run on our local RTX5090 with default batch_size=32 would give a similar result). The real robot picked up and transferred blue cubes correctly about 80% of the time, see Fig7a, while with yellow cubes it achieved ~50% success, Fig 7b, and with green and red ~30-50% success. These results are very encouraging given the complexity of the problem and the small number of dataset examples. They are much much better than we achieved with ACT on the same dataset!
Fig 6a. pi0.5 policy after 1 iteration of policy improvement.
Note multiple shapes, materials, positions.
Fig 6b. pi0.5 policy after 1 iteration of policy improvement.
Note multiple shapes, materials, positions.
Fig 6c. pi0.5 policy after 1 iteration of policy improvement.
Note multiple shapes, materials, positions.
Fig 6d. pi0.5 policy after 1 iteration of policy improvement.
Note multiple shapes, materials, positions.
Fig 7a. pi0 lora policy.
Note multiple colors, orientations, positions.
Fig 7b. pi0 lora policy.
Note multiple colors, orientations, positions.
Back to top

Discussion

  • pi0 vs ACT: ACT seems to have more difficulty with tasks that have a variety of object types, orientations, and locations. This is evident from the multi-cube and lids-on-pots datasets. pi0 and pi0.5 seem to have an easier time with such datasets, perhaps showing greater scene and object understanding.
  • pi0 vs pi0.5: While both pi0 and pi0.5 seem to show pretty good object understanding, pi0.5 seems to perform better than both pi0 and ACT on tasks requiring precision, such as the bead on a string task in Fig 5.
  • LoRA vs full finetune: Full training seems to be able to learn a greater variety of objects, as is evident from the lids and pots dataset. For example, although pi0-full does not pick up (not shown) the silver lid for the pan in Fig 6b, it gets much closer than pi0-lora which doesn't seem to "see" the metal lid at all and gets confused (not shown).
  • Human interventions: As mentioned, above, a big boost in performance can be achieved by adding episodes where a person intervenes when the robot is about to fail while running the current policy. The current policy is then further trained on the combined imitation plus intervention dataset. See our policy improvement experiment for details.
  • Dataset size: Although we are seeing very encouraging results with the above datasets, the results are not perfect. We believe that this is partly due to the small size of the datasets relative to their complexity. They each have 50-100 examples for a total of 12-24mins of data. This compares to 5-100 hours of data for task fine tuning in pi0.
  • Video resolution: One question is whether the video resolution input to pi0 and pi0.5 is high enough to 'see' the tie wrap tip in Fig 4 and the string in Fig 5. The required input resolution for the pi0 and pi0.5 models is 224x224 which is the resolution expected by PaliGemma. Currently the 640x480 video is resized to 224x224 with padding. To effectively zoom in and increase resolution, and to avoid padding, one solution is to crop the 640x480 images before the resize. See Openpi Experimental Details and also our openpi fork (development branch) for details. Whether this helps or not is still an open question.
Back to top