Building Production-Ready AI Systems with Kubernetes, PyTorch, and Ray
As AI workloads grow increasingly complex, the need for robust computing solutions is more crucial than ever. Leveraging technologies like Kubernetes and PyTorch can help organizations build production-ready AI systems that can effectively manage these complexities. During the recent KubeCon + CloudNativeCon North America 2025 Conference, Robert Nishihara from Anyscale shared invaluable insights on how a comprehensive AI compute stack comprising Kubernetes, PyTorch, vLLM, and Ray technologies can address the demands of today’s AI workloads.
The Role of Ray in AI Workloads
Ray is an open-source framework specifically designed to facilitate the building and scaling of machine learning and Python applications. Developed at Berkeley as part of a reinforcement learning research initiative, Ray orchestrates infrastructure for distributed workloads, making it a critical player in the AI landscape. Notably, Ray has recently become part of the PyTorch Foundation, reinforcing its commitment to contributing to the broader open-source AI ecosystem.
Key Drivers of AI Workload Evolution
Nishihara emphasized three core areas that are driving the evolution of AI workloads: data processing, model training, and model serving.
-
Data Processing: The scope of data processing must evolve beyond traditional tabular datasets to accommodate multimodal datasets that encompass images, videos, audio, text, and sensor data. This shift is particularly important for supporting inference tasks, a fundamental aspect of AI-powered applications. The hardware infrastructures need to adapt as well, with increasing support for GPUs alongside traditional CPUs. Nishihara pointed out that data processing has transformed from "SQL operations on CPUs" to "inferences on GPUs."
-
Model Training: This area now incorporates reinforcement learning (RL) and post-training tasks, such as generating new data through model inference. The Ray Actor API can be effectively utilized for Trainer and Generator components, creating stateful workers adept at managing method schedules on specific instances. Moreover, Ray’s native Remote Direct Memory Access (RDMA) support enhances performance by allowing direct transport of GPU objects.
- Model Serving: With the rise of open-source reinforcement learning frameworks developed on Ray, such as Cursor’s composer, organizations now have an array of tools at their disposal. Nishihara highlighted notable frameworks like Verl (Bytedance), OpenRLHF, ROLL (Alibaba), NeMO-RL (Nvidia), and SkyRL (UC Berkeley), showcasing the extensive capabilities built on training engines like Hugging Face and serving engines like vLLM.
Connecting Applications to Hardware
The architecture behind Ray is characterized by increasing complexity in both the upper and lower layers of the stack. The upper layers consist of AI workloads, model training, and inference frameworks such as PyTorch and vLLM. Conversely, the lower layers include the hardware components such as GPUs, CPUs, and container orchestrators like Kubernetes and Slurm. Distributed compute frameworks like Ray and Spark serve as critical bridges between these tiers, streamlining data ingestion and movement.
The Synergy of Kubernetes and Ray
Kubernetes and Ray complement each other effectively for hosting AI applications. While Kubernetes extends container-level isolation, Ray provides process-level isolation, facilitating both vertical and horizontal autoscaling capabilities. Nishihara remarked on the dynamic demands of the inference stage compared to model training, highlighting the advantage of shifting GPUs between stages—an ability greatly enhanced through the combined use of Ray and Kubernetes.
Essential Requirements for AI Platforms
For AI platforms to meet the demands of modern workloads, several core requirements must be addressed. These include:
- Native Multi-Cloud Support: Ensuring flexibility and scalability across various cloud environments.
- Workload Prioritization: Effectively managing GPU reservations to handle varying demands.
- Observability and Tooling: Implementing robust monitoring systems for optimized performance.
- Model and Data Lineage Tracking: Keeping a clear record of changes and updates.
- Governance: Ensuring compliance and management oversight throughout the AI lifecycle.
Finally, observability is crucial at both the container and workload levels, enabling the monitoring of essential metrics like object transfer speeds. In an increasingly competitive landscape, establishing a resilient and dynamic AI compute stack will be pivotal for organizations aiming to stay ahead.
Inspired by: Source

