Tesla backs vision-only approach to autonomy using powerful supercomputer

Tesla CEO Elon Musk has been teasing a neural network training computer called ‘Dojo’ since at least 2019. Musk says Dojo will be able to process vast amounts of video data to achieve vision-only autonomous driving. While Dojo itself is still in development, Tesla today revealed a new supercomputer that will serve as a development prototype version of what Dojo will ultimately offer. 

At the 2021 Conference on Computer Vision and Pattern Recognition on Monday, Tesla’s head of AI, Andrej Karpathy, revealed the company’s new supercomputer that allows the automaker to ditch radar and lidar sensors on self-driving cars in favor of high-quality optical cameras. During his workshop on autonomous driving, Karpathy explained that to get a computer to respond to new environment in a way that a human can requires an immense dataset, and a massively powerful supercomputer to train the company’s neural net-based autonomous driving technology using that data set. Hence the development of these predecessors to Dojo.

Tesla’s newest-generation supercomputer has 10 petabytes of “hot tier” NVME storage and runs at 1.6 terrabytes per second, according to Karpathy. With 1.8 EFLOPS, he said it might be the fifth most powerful supercomputer in the world, but he conceded later that his team has not yet run the specific benchmark necessary to enter the TOP500 Supercomputing rankings.

“That said, if you take the total number of FLOPS it would indeed place somewhere around the fifth spot,” Karpathy told TechCrunch. “The fifth spot is currently occupied by NVIDIA with their Selene cluster, which has a very comparable architecture and similar number of GPUs (4480 vs ours 5760, so a bit less).”

Musk has been advocating for a vision-only approach to autonomy for some time, in large part because cameras are faster than radar or lidar. As of May, Tesla Model Y and Model 3 vehicles in North America are being built without radar, relying on cameras and machine learning to support its advanced driver assistance system and autopilot. 

When radar and vision disagree, which one do you believe? Vision has much more precision, so better to double down on vision than do sensor fusion.

— Elon Musk (@elonmusk) April 10, 2021

Many autonomous driving companies use lidar and high definition maps, which means they require incredibly detailed maps of the places where they’re operating, including all road lanes and how they connect, traffic lights and more. 

“The approach we take is vision-based, primarily using neural networks that can in principle function anywhere on earth,” said Karpathy in his workshop. 

Replacing a “meat computer,” or rather,  a human, with a silicon computer results in lower latencies (better reaction time), 360 degree situational awareness and a fully attentive driver that never checks their Instagram, said Karpathy.

Karpathy shared some scenarios of how Tesla’s supercomputer employs computer vision to correct bad driver behavior, including an emergency braking scenario in which the computer’s object detection kicks in to save a pedestrian from being hit, and traffic control warning that can identify a yellow light in the distance and send an alert to a driver that hasn’t yet started to slow down.

Tesla vehicles have also already proven a feature called pedal misapplication mitigation, in which the car identifies pedestrians in its path, or even a lack of a driving path, and responds to the driver accidentally stepping on the gas instead of braking, potentially saving pedestrians in front of the vehicle or preventing the driver from accelerating into a river.

Tesla’s supercomputer collects video from eight cameras that surround the vehicle at 36 frames per second, which provides insane amounts of information about the environment surrounding the car, Karpathy explained.

While the vision-only approach is more scalable than collecting, building and maintaining high definition maps everywhere in the world, it’s also much more of a challenge, because the neural networks doing the object detection and handling the driving have to be able to collect and process vast quantities of data at speeds that match the depth and velocity recognition capabilities of a human.

Karpathy says after years of research, he believes it can be done by treating the challenge as a supervised learning problem. Engineers testing the tech found they could drive around sparsely populated areas with zero interventions, said Karpathy, but “definitely struggle a lot more in very adversarial environments like San Francisco.” For the system to truly work well and mitigate the need for things like high-definition maps and additional sensors, it’ll have to get much better at dealing with densely populated areas.

One of the Tesla AI team game changers has been auto-labeling, through which it can automatically label things like roadway hazards and other objects from millions of videos capture by vehicles on Tesla camera. Large AI datasets have often required a lot of manual labelling, which is time-consuming, especially when trying to arrive at the kind of cleanly-labelled data set required to make a supervised learning system on a neural network work well.

With this latest supercomputer, Tesla has accumulated 1 million videos of around 10 seconds each and labeled 6 billion objects with depth, velocity and acceleration. All of this takes up a whopping 1.5 petabytes of storage. That seems like a massive amount, but it’ll take a lot more before the company can achieve the kind of reliability it requires out of an automated driving system that relies on vision systems alone, hence the need to continue developing ever more powerful supercomputers in Tesla’s pursuit of more advanced AI.