Vision-Language-Action Model

VLA, VLA model, Vision-Language-Action

Agent Types

Foundations

Soft glowing orange and yellow light with a gradient blending into black background.

TL;DR

A class of robotics foundation models that directly translates visual inputs and natural language instructions into low-level mechanical control commands.

In depth

A Vision-Language-Action model bridges the gap between semantic understanding and physical manipulation by mapping visual perception directly to robotic trajectory outputs. By fine-tuning large vision-language backbones on rich trajectory datasets, these architectures learn to generalize manipulation tasks across unseen environments and novel physical objects. Instead of relying on rigid, pre-programmed automation scripts, VLAs generate continuous, real-time control directives based on open-ended natural language prompts.

Why this matters for your business

This technology fundamentally transforms the robotics industry by enabling generalist robots to autonomously navigate, reason, and perform complex manual tasks in dynamic workspaces and households.