SmolVLA: A Vision-Language-Action Model for Affordable and Efficient . . . In this work, we present SmolVLA, a small, efficient, and community-driven VLA that drastically reduces both training and inference costs, while retaining competitive performance SmolVLA is designed to be trained on a single GPU and deployed on consumer-grade GPUs or even CPUs