COBALT: Crowdsourcing Robot Learning via Cloud-Based Teleoperation with Smartphones

The scarcity of large-scale, high-quality demonstration data remains a bottleneck in scaling imitation learning for robotic manipulation. We present Cobalt, a teleoperation platform designed to alleviate this bottleneck and democratize robot learning at scale both in simulation and in the real world. By leveraging vectorized environments, our scalable, load-balanced infrastructure supports concurrent teleoperation by multiple users on a single GPU. Operators can connect from nearly anywhere on Earth using commonly available devices, including single- or dual-smartphones, VR headsets, 3D mice, and keyboards. An in-memory data cache and WebRTC video streaming keep control and rendering synchronous, sustaining dozens of concurrent users at 20 Hz with sub-100 ms end-to-end latency. We demonstrate concurrent support for 256 clients across 8 GPUs, underscoring the system's ability to scale both horizontally across hardware and within individual servers. We perform a comprehensive user study showing that phone-based teleoperation performs comparably or better than specialized hardware, enabling faster, more ergonomic data collection. To ensure data quality, Cobalt logs a suite of real-time metrics to filter suboptimal demonstrations automatically. We further demonstrate that a structured user training curriculum significantly improves task success and downstream behavior cloning performance. Guided by insights from our user study, we crowdsource the collection of a large-scale, high-quality pilot dataset with 7500+ demonstrations (50+ hours) collected with smartphones across nine countries over 5 days. We validate its quality by training state-of-the-art imitation learning algorithms.

Cobalt is a scalable, cloud-based data collection platform that enables users worldwide to remotely teleoperate simulated and real robots. By leveraging low-latency networking, diverse input devices, multiple simulation frameworks, and real-world teleoperation capabilities, Cobalt facilitates large-scale crowdsourcing and democratizes the creation of high-quality robotics datasets. At the core of Cobalt is a robust cloud-based architecture designed for scalability and efficiency.

When a user connects to the main COBALT server, a load balancer routes them to the task group that corresponds to the requested task. Each task group can dynamically provision virtual machines based on demand.

Each virtual machine runs three services connected via Redis: the client session service ingests client data and publishes pose commands, the teleoperation service steps the simulation and generates camera renders, and the media service streams these renders to the user via low-latency WebRTC.

By using such a modular and adaptable approach, Cobalt remains highly scalable and efficient. We consistently achieve sub-100 ms end-to-end latency and demonstrated concurrent support for 256 users across 8 GPUs.

To ensure users are prepared for teleoperating simulated robots using Cobalt and to evaluate the effectiveness of different input modalities, we developed a training curriculum consisting of calibration and evaluation tasks with the MuJoCo simulation environment. The calibration tasks are designed to familiarize users with basic controls, ensuring they can collect high-quality demonstration data. The position (top left), rotation (bottom left), and pose (top right) tasks ensure that users acquire these fundamental skills. Evaluation tasks introduce accuracy and precision measurements as well as time limits. They also progressively increase in difficulty with each round. The beam task (bottom right) is added to the evaluation task suite. We evaluate trajectories for evaluation tasks using our metrics to assess user data quality.

We recruited 12 participants for an initial user study. Half of them were assigned the training curriculum prior to data collection, while the other half served as the control group. Each participant used two randomly assigned input devices (chosen from smartphone, VR headset, 3D mouse, keyboard) to generate data for four MimicGen tasks (Three Piece Assembly, Lift, Mug Cleanup, Coffee). In addition to analyzing the trajectories using our metrics, subjective feedback was collected using NASA-TLX surveys and Likert-scale questionnaires. A separate study was conducted with six additional participants to compare dual-smartphone and VR control for bimanual tasks.

We developed a set of metrics to quantitatively evaluate the efficacy of input devices, training curricula, ergonomics, and data utility. Specifically, our metrics include task success rate, completion time, path length, translational/rotational motion jitter, network latency, and network jitter.

We used statistically significant results from the small-scale user study to guide the design of a large-scale, crowdsourced pilot dataset. We created six environments in Isaac Lab that holistically capture the range of skills we believe useful manipulators should exhibit. For example, our tasks were designed to require precision movement, large rotations, and long-horizon planning for successful completion. Successful BC model (ACT or Diffusion Policy) rollouts for each task are shown below.

The Cobalt teleoperation pipeline was validated on a physical Franka Panda arm using the Polymetis control library. We achieved low-latency, real-time control by configuring our system to allow a direct connection to a local Polymetis server. Operators guided the Panda through a standard Lift task and we then trained a BC-RNN policy on this data, confirming that phone-based teleoperation generalizes effectively from simulation to real-world hardware with minimal configuration.

COBALT: Crowdsourcing Robot Learning via Cloud-Based Teleoperation with Smartphones

Cobalt can be used to collect data across a variety of both simulated and real-world environments, including bimanual tasks.

Abstract

System Architecture

Training Curriculum

Cobalt features calibration and evaluation tasks to accelerate user onboarding and boost data quality.

User Study

Completing the training curriculum significantly decreases the reset rate during data collection across all input devices.

Metrics

Various metrics demonstrate that smartphones are the best input device for scalable data collection.

Pilot Dataset

Real-World Data Collection

Cobalt can be used to teleoperate physical robots.