- sota
- Posts
- Mobile ALOHA
Mobile ALOHA
🏖️ How Mobile ALOHA opens up the path to Embodied AI
Post by Phil Butler
We’re one step closer to Embodied AI with the introduction of Mobile ALOHA. Before taking a look at Mobile ALOHA, let’s first recap ALOHA.
ALOHA: A Low-cost Open-source Hardware System for Bimanual Teleoperation (project page)
Tony Z. Zhao, Vikash Kumar, Sergey Levine, Chelsea Finn
Contributions:
The hardware design of an affordable stationary human-controllable robot capable of learning to do tasks autonomously.
This brings the price of a similar system from ~$100k* → <$20k.
Complete with a hardware parts list, tutorial to build, and code to operate
A new specialized learning algorithm & code for imitation learning, Action Chunking with Transformers (ACT) (summary)
Goal: maps camera inputs & joints → action sequences
(details in paper - Section IV & Appendix C)
Mobile ALOHA: the same, but it can move around a room (project page)
Zipeng Fu , Tony Z. Zhao , Chelsea Finn
Contributions:
The design of an affordable mobile human-controllable robot,
bringing the price from >$200k to $32k, complete with a hardware parts list, tutorial to build, and hardware code
Experimentation & learning code for imitation learning using ACT, Diffusion Policy, & VINN
Goal: maps camera inputs & joints → action sequences
Evidence that co-training on data of diverse tasks from different but similar types of robots boosts performance
(explained in Section 4, experiments in 6.1 of paper)
In each project, during teleoperation, training data for autonomy is created:
Inputs: camera feeds and joint positions
Targets: sequences of actions taken
Why this matters:
Early signs of successful task transfer
Let’s be clear: the tasks that Mobile ALOHA performs autonomously are after training it on those specific tasks (50 demonstrations per task) in the fashion explained above.
However, its performance on those tasks improves when it’s co-trained with the data of different tasks from a different, but similar robot (ALOHA).
But as stated, there are “few to none accessible bimanual mobile manipulation datasets”.
Let’s look at some of their results before getting ahead of ourselves.
Numbers are success rates (#Success / #Attempts). For subtasks, #Attempts is #Success from the previous subtask. Best scores for Whole Task is in bold. Each success rate was computed after 20 trials.
From the table above, let’s focus on Diffusion Policy (DP) and ACT since they performed the best. For Wipe Wine, co-training boosted success 35% → 65% and 50% → 95% for DP and ACT respectively. For Push Chairs, co-training boosted success 80% → 100% for DP (already worked everytime using ACT).
Keep in mind that the ALOHA datasets do not consist of these specific mobile tasks (includes stationary tasks like fastening a velcro cable, slotting a battery, etc.), yet it still boosts performance.
Now let’s extrapolate using the metephor of current large models like LLMs or large diffusion models:
If we can aggregate a much larger dataset with many diverse tasks, then the same way LLMs are pre-trained on the general task of next token prediction, then fine-tuned on specific tasks like summarization, we probably will have a chance at few-shot replication of complex robotics tasks.
The issue is scaling the collection of this data, which we next look at.
Addressing the hardware cost bottleneck
Bottlenecks of progress towards robots capable of complex tasks
Let’s use a metaphor of the bottlenecks that had to be opened up for the explosion of deep learning.
First Yann Lecun proposed Convolutional Neural Networks in the 90’s, opening up the bottle neck of a generalizable performant ML algorithm for this task.
The idea received limited attention until it was shown that by training on better hardware, GPUs, that it could peform extremely well - this opens up the bottleneck that research labs should spend resources to adopt and improve this technique.
Why do GPUs exist at all? Why can consumers buy one at a reasonable price and slap it into their PC? We have the video game industry to thank for that. Companies like NVIDIA spend large sums of money to produce the best cards they can, as cheap as they can, so that they can sell as many as they can, resulting in reasonably priced processors.
Lastly, can we get large quantities of high quality training data? For modalities of text, image/video, and speech, the internet has allowed for this data to be abundant. But we can’t say the same for robotic movement.
Enter Mobile ALOHA. Thanks to the open sourcing of its parts, design, tutorial for building, operation code, and learning code, more research labs can improve its design, build upon the publically available dataset of tasks, and/or improve upon the learning algorithms.