SkyPilot and Slurm: Bridge HPC and Cloud for AI
Building and managing your AI lifecycle shouldn't be a choice between expensive cloud services and the DIY chaos of open-source tools. That's the philosophy behind Cake AI. We provide a complete, end-to-end environment for everything from data engineering to model training and inference. So, how does it all work? This post pulls back the curtain on our platform architecture. We'll show you how our design choices give you flexibility and control, including how our unique skypilot slurm integration helps you bridge the gap between on-prem compute and cloud-native slurm kubernetes workflows.
Holistic AI Lifecycle Management
The Cake platform integrates the entire range of capabilities needed for managing the AI lifecycle, including:
Advanced large language and embedding models
Multi-agent systems
3D parallel training and fine-tuning capabilities
Model monitoring and observability tools
GPU auto-scaling (from zero to thousands of nodes) for both training and inference
Exploratory data analysis (EDA) and AutoML frameworks
Cloud cost monitoring and optimization
PII/PHI anonymization utilities
Built to handle both traditional ML and generative AI workloads, Cake provides centralized management—a “single pane of glass”—to oversee every AI project. Whether you’re fine-tuning access controls, optimizing resource allocation, or managing billing for partner tenants, Cake eliminates complexity while ensuring there’s no added cost burden, or “AI tax,” on Kubernetes, Slurm, or virtual machine (VM) instances.
Deployment Flexibility: VPC and On-Prem
Cake deploys directly into your own virtual private cloud (VPC) or on-premises infrastructure. This ensures no sensitive data ever leaves your environment. With encryption both in transit and at rest, along with robust Kubernetes role-based access controls (RBAC), Cake prioritizes security at every layer.
Every component is authenticated, and platform access is scoped based on user roles, ensuring a least-privilege model. Even the deployment itself adheres to infrastructure-as-code (IaC) principles, where all changes are version-controlled through Git repositories. This gives teams full transparency and control over their infrastructure.
Recipes: Accelerating Common Use Cases
Cake is a unified platform featuring over 100 open-source components, designed to streamline AI development and deployment. Key components include:
Inference engines
Model and data drift detection monitors
Data engineering tools
Vector databases
Fine-tuning frameworks for open source LLMs
Experiment tracking and model registries
Pipeline orchestration engines
Cloud-based notebooks
To simplify common workflows and MLOps tasks, Cake provides Recipes—pre-assembled infrastructure components, pipelines, and example notebooks. These Recipes accelerate project timelines for use cases such as Advanced RAG, GraphRAG, learning agents, recommendation engines, churn modeling, image segmentation, dynamic pricing, and many more.
Each Recipe includes a fully integrated component stack to support seamless development, testing, and deployment.
Example stack for GraphRAG on the Cake platform

Kubernetes at the Core
Cake is built on Kubernetes, providing a scalable, secure, and resource-efficient foundation capable of managing demanding AI workloads. Cake lets you exploit the advantages of Kubernetes without experiencing the complexity. Data scientists can deploy models, run experiments, and access resources without requiring Kubernetes expertise.
Kubernetes gives Cake dynamic scaling, workload isolation, and seamless integration with AI tools, creating a robust environment for both development and deployment. Kubernetes also provides self-healing capabilities allowing Cake to minimize downtime by automatically restarting failed pods and performing rolling updates with minimal disruption. These features ensure reliability, scalability, and cost efficiency in AI operations.
Kubernetes also provides numerous cost and flexibility advantages. For example, Kubernetes makes Cake cloud-agnostic, supporting multi-cloud and on-prem deployments. This versatility would be almost unattainable without Kubernetes. Additionally, Kubernetes provides substantial cost savings compared to common alternatives because its lightweight container architecture enables workloads to share nodes efficiently. Finally, unlike platforms such as SageMaker, Vertex AI, or Databricks, Cake leverages standard cloud VM instances without incurring AI-specific costs, avoiding the “AI tax.” (e.g., as of this writing SageMaker instances cost on average 40% more than EC2 instances that can be recruited by EKS).
Cake Deployment Architecture

Click to view full size
Beyond Kubernetes: SkyPilot and Slurm
Cake’s integration of SkyPilot enhances Cake’s AI infrastructure by enabling the seamless execution of non-Kubernetes workloads. SkyPilot facilitates cost-efficient, multi-cloud deployments via a streamlined interface that optimizes resource utilization and automates spot instance management. SkyPilot supports over a dozen cloud and on-prem configurations. It dynamically selects the most cost-effective and high-performing nodes that can satisfy the requirements for your workload. These can be across AWS, GCP, Azure, LambdaLabs and others.
Features such as built-in fault tolerance, dynamic scaling, and AI/ML workflow support simplify tasks such as distributed training and hyperparameter tuning. Additionally, SkyPilot ensures resilience with workload checkpointing and recovery while maintaining reproducibility through consistent configurations, making it an indispensable tool for managing scalable, reliable, and cost-effective AI systems.
Beyond SkyPilot, Cake supports Slurm for deployments in non-cloud environments where data locality or on-premise cost considerations are paramount. For example: Slurm is a powerful option if you need to train models in multiple hospitals and combine them using a federated learning framework (e.g., Flower) because patient data cannot be egressed. This flexibility is crucial for this kind of specialized use case.
Cake’s platform can also be configured for seamless access from developer or data scientist laptops, enabling IDEs like Visual Studio or PyCharm to interact remotely with cloud-based Jupyter, RStudio, or VS Code notebooks. Local testing of Ray training or serving code further enhances the development workflow.
Understanding the landscape: Slurm vs. Kubernetes for AI
When it comes to orchestrating large-scale AI workloads, two names often come up: Slurm and Kubernetes. While both are powerful, they were designed with different philosophies and excel in different areas. Understanding their core strengths is key to building an effective AI infrastructure. For many teams, the choice isn't about picking one over the other, but about finding a way to leverage the best of both worlds without creating unnecessary complexity. This is where a managed platform can make all the difference, by abstracting away the underlying infrastructure so your team can focus on building models.
The case for Slurm: Gang scheduling and dedicated resources
Slurm is a long-standing champion in the world of high-performance computing (HPC). Its biggest strength for AI is something called "gang scheduling." Think of it as an all-or-nothing resource allocation. When you start a massive training job that needs a specific number of GPUs, Slurm ensures all of those resources are secured and dedicated to your job from start to finish. This guarantee is critical for long, expensive training runs where interruptions could mean starting over. It provides the stability and predictability that research-intensive AI work demands, ensuring your job has exclusive access to the hardware it needs.
The case for Kubernetes: Autoscaling and flexibility
Kubernetes, on the other hand, shines in dynamic, production-focused environments. Its standout feature is its ability to autoscale resources. It can automatically add or remove compute resources like GPUs based on real-time demand, which is incredibly cost-effective. You only pay for what you use. Beyond cost savings, Kubernetes is exceptionally flexible. It can manage a diverse range of tasks—from data processing and model training to model serving—all on a single, unified platform. This versatility makes it a great foundation for building end-to-end AI applications that need to adapt quickly.
The scale of modern AI infrastructure
Here’s the reality: neither Slurm nor Kubernetes was originally built with the unique demands of modern AI in mind. Slurm comes from the world of HPC, and Kubernetes from web-scale application management. While both have been adapted for AI, they each have trade-offs. The ultimate goal for any AI team should be to build and deploy models, not to become experts in managing complex, distributed infrastructure. The ideal platform abstracts this choice away, allowing data scientists to access the right resources for the job without getting bogged down in the operational details of the underlying scheduler.
The challenge of building a hybrid AI infrastructure
Given the distinct advantages of both systems, many organizations try to create a hybrid infrastructure that combines Slurm and Kubernetes. The idea is to use Slurm for heavy-duty training and Kubernetes for everything else. In practice, this is much harder than it sounds. A common approach is to run Slurm *inside* a Kubernetes cluster, but this often leads to significant inefficiencies. For example, this setup can force you to reserve entire compute nodes for Slurm, meaning those expensive GPU resources sit idle when no Slurm jobs are running. This creates a clunky, resource-intensive system that defeats the purpose of a flexible, hybrid environment.
How SkyPilot bridges the gap between Slurm and the cloud
This is precisely the problem that Cake’s integration with SkyPilot solves. SkyPilot acts as a smart abstraction layer that sits on top of your compute resources, whether they are in the cloud, on-prem, or in a Slurm cluster. It provides a single, simple interface for launching jobs, letting you tap into the power of different environments without the operational headache. Instead of wrestling with two separate systems, your team can use one unified tool to run workloads wherever it makes the most sense, whether that’s on a Kubernetes cluster, a fleet of cloud VMs, or an on-prem Slurm cluster.
How the SkyPilot and Slurm integration works
The way SkyPilot integrates with Slurm is elegant in its simplicity. It essentially treats your on-premise Slurm cluster as just another "cloud" provider, right alongside AWS, GCP, and Azure. This means your data scientists can submit a job to the Slurm cluster using the exact same SkyPilot command or configuration file they would use to launch it on a cloud instance. There’s no need to learn Slurm-specific commands or write custom submission scripts. SkyPilot handles the translation, making your on-prem hardware a seamless extension of your cloud resources.
Key SkyPilot features for Slurm users
This integration offers some powerful advantages. First, it provides a unified access point. If your organization has multiple Slurm clusters, your team can access all of them through a single, consistent interface without juggling different logins or scripts. Second, it makes your workloads portable. The same SkyPilot YAML file that defines your job's requirements can be used to run that job on your Slurm cluster, a Kubernetes pod, or a standard cloud VM. This removes friction and allows your team to move workloads to the most appropriate environment with minimal effort.
Current limitations to be aware of
To give you a complete picture, it's important to know about a couple of current limitations with the SkyPilot and Slurm integration. As of now, jobs launched on Slurm via SkyPilot do not automatically stop if they become idle, which is a feature available for cloud instances. Additionally, the integration does not yet support the use of custom container images like Docker. These are important considerations to keep in mind as you plan your workloads, but for many use cases, the benefits of a unified and portable infrastructure provide a massive advantage.
Infrastructure as Code (IaC) and GitOps
Cake leverages Infrastructure as Code (IaC) principles, integrating Terraform, Crossplane, and ArgoCD to deliver a scalable and consistent platform for managing AI infrastructure. All changes are version-controlled, auditable, and reviewable before deployment.
Terraform handles provisioning and management of core cloud resources including GPU-enabled nodes and storage, forming the backbone of Cake’s infrastructure. ArgoCD handles Kubernetes application deployment using the same GitOps principles as Terraform, by installing desired Helm and Kustomize to the platform. Finally, Crossplane enables Kubernetes to dynamically provision additional cloud resources as ArgoCD adds or removes open-source applications from the running platform.
This combination of IaC software provides declarative, versioned configurations for AI applications, enhancing auditability, collaboration, and disaster recovery. The approach minimizes manual errors, optimizes resource utilization, and accelerates deployment across the AI lifecycle, from data processing to model training and deployment.
Scaling and Cost Optimization
Cake leverages advanced Kubernetes autoscalers such as Karpenter to dynamically scale training and inference clusters from zero to thousands of nodes, optimizing for specific node types. Need fast GPUs for fine-tuning and cost-effective GPUs for inference? Cake’s autoscalers ensure resources are allocated efficiently.
By continuously monitoring cloud resource usage and storing metrics in Prometheus, Cake provides detailed insights into workload costs. Resource Quotas and Limits help you prevent resource overuse, mitigating unexpected expenses.
For distributed computing, Cake currently integrates Ray and Spark. Ray supports low-latency, scalable AI workloads such as distributed training, hyperparameter tuning, and real-time inference, while Spark excels at big data preprocessing, ETL, and streaming tasks with Spark SQL and big data ecosystem integration. These tools together enable high-performance, scalable, fault-tolerant workflows, streamlining data processing and model deployment.
Cake also supports a range of inference engines and model servers, including vLLM, Triton Tensor-RT, RayServe, and KServe ModelMesh. These leverage Kubernetes autoscaling to optimize fractional GPU usage, implement 3D DeepSpeed parallelism for cost-effective deployment on cheaper GPUs, handle complex inference graphs, and enable advanced deployment strategies such as canary deployments and A/B testing.
State-of-the-Art Security Architecture
The Cake platform is designed to meet critical security objectives that align with the interests of all stakeholders:
Maximize productivity by implementing role-based access controls in Kubeflow, ensuring users have the least privilege necessary to access proprietary data.
Minimize risk from unauthorized or accidental changes by requiring appropriate review and testing processes based on the potential impact of changes on production systems.
Enhance observability by providing audit logs and other vital information necessary for effective incident response, non-repudiation, and forensic analysis.
Ensure availability and stability at a level that matches the business criticality of the services provided, maintaining reliable and robust performance.
Cake’s integration of an Envoy gateway, an Istio service mesh, and OAuth 2.0 significantly strengthens Kubernetes’ role-based access control (RBAC) security by enabling robust service-to-service authentication, authorization, and encryption.
Istio, the service mesh, efficiently manages traffic between AI microservices within the Cake platform, working seamlessly with Envoy, a high-performance proxy, to enforce critical security measures such as mutual TLS (mTLS) for encrypted communication. This ensures that only authorized services can interact and that all communications remain secure. Additionally, Istio’s policy engine provides fine-grained access control by leveraging attributes such as identity and metadata to enforce restrictions. Complementing these tools, OAuth 2.0 enhances user authentication and authorization by integrating with corporate identity providers, delivering a comprehensive security framework.
The Cake platform implements robust, high-level threat mitigations to ensure security and compliance:
Encryption-in-transit: All access to data in transit is strongly encrypted
Encryption-at-Rest: Critical security data is protected via encryption at rest using cloud Key Management Services (KMS) and Kubernetes secrets.
Principle of Least Privilege: Permissions are tightly scoped to enable only necessary operations, leveraging Kubernetes Role-Based Access Control (RBAC).
User Authentication: Access to platform services requires authentication enforced through a service mesh and gateway via OIDC authentication, supporting Cloud IAM, LDAP, and Active Directory.
Network Segmentation: Cake is deployed in a dedicated AWS account or GCP project with a new VPC, ensuring deliberate routing to critical organizational resources.
Audit Logging and Non-repudiation: User activity is tracked via cloud provider logs and service mesh sidecars. Cluster changes are managed through pull requests to a client-controlled IaC repository, with component logging directed to secure cloud logging services.
Data Integrity: Privileges to modify data are tightly restricted. Default profile namespaces require explicit permissions for data access, and well-protected data warehouses or cloud-managed databases ensure integrity.
PII Handling: Designed to facilitate proper handling of PII/PHI, the platform restricts user access to scoped namespaces and supports data sanitization in training while allowing full datasets for production inference. Tools for PII best practices can be integrated into workflows.
Code Scanning: Docker images are automatically scanned, and SBOMs are generated for compliance.
Certification: The platform itself is SOC 2 and HIPAA certified, simplifying certification processes for users.
Example architecture for tenant isolation within Cake

Monitoring and Observability
Cake leverages Prometheus as its core monitoring solution. Metrics from all installed open-source applications are written to Prometheus, enabling visualization in dashboards. Cake integrates Prometheus with the Istio service mesh and Envoy gateway to capture detailed request/response metrics for enhanced observability. Prometheus provides a robust time-series database and PromQL query language, enabling efficient scraping, storage, and analysis of system, application, and custom metrics from Kubernetes clusters. Its seamless integration with Kubernetes—using Service Discovery and labels—supports dynamic environments.
Cake extends Prometheus with advanced dashboarding and alerting capabilities, offering real-time insights into AI workloads, model performance, and resource utilization through dynamic dashboards. Customizable, rule-based alerting ensures teams are promptly notified of critical issues such as resource bottlenecks or degraded AI model performance, maintaining system reliability and high availability.
Additionally, Cake includes Retrieval-Augmented Generation (RAG) monitoring tools to capture logs and debugging data, pushing this information to standard cloud solutions like CloudWatch or vendor solutions like DataDog. Cake’s pipeline and workflow engines further process these logs, integrating them into labeling tools such as Label Studio to build datasets for fine-tuning models. They also support evaluation of auto-generated prompts from tools such as DSPy, assessing metrics such as cost, latency, token usage, and accuracy on various models, including fine-tuned LLMs.
Cake monitors data drift and can automatically trigger retraining pipelines based on this metric, ensuring models remain performant and aligned with evolving data distributions.
Data Engineering and Storage
The Cake platform encrypts all data by default, restricting access to specific users or projects. However, it also enables dataset sharing through multiple channels, including CakeFS, a shared POSIX file system for seamless data exchange between notebooks and applications, and support for object storage solutions such as S3.
Cake’s data engineering tools optimize pipelines to deliver clean, reliable, and accessible data for AI model development and deployment. These tools integrate with databases, data warehouses, and data lakes, allowing efficient data extraction, transformation, and centralization, along with privacy preservation. For quick access to data locked in systems such as ticketing platforms or S3, Cake simplifies integration and ensures rapid availability for use cases such as Retrieval-Augmented Generation (RAG).
Cake also transforms raw data into analysis-ready formats, facilitating feature engineering and maintaining consistency and quality for training. It automates end-to-end pipeline orchestration, managing ingestion, preprocessing, and model training workflows while handling dependencies and scheduling. Additionally, Cake provides open-source solutions for vector databases, graph databases, and feature stores, enhancing scalability, automation, and reproducibility. These capabilities reduce engineering overhead and enable faster, more robust AI development.
The Bottom Line
Cake is more than just a platform; it’s a unified AI ecosystem designed to handle the most complex and highly scaled workloads with relative ease. By abstracting the underlying infrastructure while maintaining full control and observability, Cake empowers data scientists and AI and MLOps engineers to focus on innovation, not infrastructure.
From Kubernetes-powered scalability to robust security, integrated monitoring, and seamless data workflows, Cake is the platform that takes AI from concept to production—without compromise.
Frequently asked questions
Do my data scientists need to be Kubernetes experts to use the Cake platform? Not at all. That’s one of the main problems we solve. The platform is designed to abstract away the complexities of the underlying infrastructure. Your data scientists can launch experiments, deploy models, and access the resources they need through a simple, unified interface without ever having to write a Kubernetes manifest or a Slurm submission script. They get to focus on building great models, not on managing infrastructure.
How does deploying Cake in our own environment protect our sensitive data? By deploying directly into your own virtual private cloud (VPC) or on-premise servers, Cake ensures that your proprietary data never has to leave your control. All data is encrypted both while it's moving and while it's stored. This architecture gives you the full power of a managed AI platform without compromising on the security and privacy of your most critical information.
What are 'Recipes' and how do they actually help my team get started? Think of Recipes as pre-built starter kits for common AI projects. Instead of having your team spend weeks configuring tools for a task like advanced document analysis (RAG) or churn prediction, they can use a Recipe. Each one includes the necessary infrastructure components, data pipelines, and example code, all integrated and ready to go. This allows your team to skip the setup and get straight to solving the business problem.
How does Cake help us control our AI spending? We help you manage costs in a few key ways. First, because Cake runs on standard cloud virtual machines, you avoid the significant markups that other managed AI platforms often add—what we call the "AI tax." Second, our platform uses intelligent autoscaling to provide resources like GPUs only when they're needed, scaling down to zero when they're not. This prevents you from paying for expensive hardware that’s sitting idle.
My team uses both on-premise Slurm clusters and the cloud. How does Cake handle this kind of hybrid setup? Cake is built for this exact scenario. We use a tool called SkyPilot to create a bridge between your different computing environments. It allows your team to treat your on-premise Slurm cluster as just another resource, right alongside cloud providers like AWS or GCP. This means they can use a single, consistent workflow to run jobs wherever it makes the most sense, without the headache of managing two separate systems.
Key Takeaways
- Unify your entire AI stack: Cake integrates the best open-source tools for data, training, and inference into a single platform, allowing your team to focus on building models instead of managing complex infrastructure.
- Bridge on-premise and cloud resources: The platform uses Kubernetes for flexible scaling and uniquely integrates with Slurm via SkyPilot, providing one simple workflow to run jobs on the right compute without the operational headache.
- Deploy AI securely without the vendor tax: Cake runs inside your own cloud or on-premise environment for maximum data control and security, all while using standard compute instances to avoid expensive, AI-specific pricing.
Related Articles
About Author
Skyler Thomas
Skyler is Cake's CTO and co-founder. He is an expert in the architecture and design of AI, ML, and Big Data infrastructure on Kubernetes. He has over 15 years of expertise building massively scaled ML systems as a CTO, Chief Architect, and Distinguished Engineer at Fortune 100 enterprises for HPE, MapR, and IBM. He is a frequently requested speaker and presented at numerous AI and Industry conferences including Kubecon, Scale-by-the-Bay, O’reilly AI, Strata Data, Strat AI, OOPSLA, and Java One.
More articles from Skyler Thomas
Related Post
The Hidden Costs Nobody Expects When Deploying AI Agents
Skyler Thomas & Carolyn Newmark
What is Cake AI? An Introduction to Our Platform
Skyler Thomas
The Case for Smaller Models: Why Frontier AI Is Not Always the Answer
Skyler Thomas & Carolyn Newmark
The AI Budget Crisis You Can’t See (But Are Definitely Paying For)
Skyler Thomas & Carolyn Newmark