The Future of AI Ops: Exploring the Cake Platform Architecture

Author: Skyler Thomas

Last updated: May 8, 2025

Featured Posts

What Are AI Voice Agents? A Guide for Businesses

How to Build an Agentic RAG Application

What is Agentic RAG? The Future of AI Automation

Top 8 Vector Databases: Choosing the Right One for Your Project

Top 9 Data Ingestion Tools for Seamless Data Pipelines

Real-World AI Applications: Transforming Industries

Agentic AI Explained: Core Concepts, Uses, and Impact

Machine Learning Platforms: A Practical Guide to Choosing

6 of the Best Open-Source AI Tools of 2025 (So Far)

LLMOps Explained: Your Guide to Managing Large Language Models

Cake is an end-to-end environment for managing the entire AI lifecycle, from data engineering and model training, all the way to inference and monitoring. This article will guide you through the high-level Cake platform architecture, providing an overview of how its design choices streamline AI operations while maintaining flexibility, security, and control.

Holistic AI Lifecycle Management

The Cake platform integrates the entire range of capabilities needed for managing the AI lifecycle, including:

Advanced large language and embedding models
Multi-agent systems
3D parallel training and fine-tuning capabilities
Model monitoring and observability tools
GPU auto-scaling (from zero to thousands of nodes) for both training and inference
Exploratory data analysis (EDA) and AutoML frameworks
Cloud cost monitoring and optimization
PII/PHI anonymization utilities

Built to handle both traditional ML and generative AI workloads, Cake provides centralized management—a “single pane of glass”—to oversee every AI project. Whether you’re fine-tuning access controls, optimizing resource allocation, or managing billing for partner tenants, Cake eliminates complexity while ensuring there’s no added cost burden, or “AI tax,” on Kubernetes, Slurm, or virtual machine (VM) instances.

Deployment Flexibility: VPC and On-Prem

Cake deploys directly into your own virtual private cloud (VPC) or on-premises infrastructure. This ensures no sensitive data ever leaves your environment. With encryption both in transit and at rest, along with robust Kubernetes role-based access controls (RBAC), Cake prioritizes security at every layer.

Every component is authenticated, and platform access is scoped based on user roles, ensuring a least-privilege model. Even the deployment itself adheres to infrastructure-as-code (IaC) principles, where all changes are version-controlled through Git repositories. This gives teams full transparency and control over their infrastructure.

Recipes: Accelerating Common Use Cases

Cake is a unified platform featuring over 100 open-source components, designed to streamline AI development and deployment. Key components include:

Inference engines
Model and data drift detection monitors
Data engineering tools
Vector databases
Fine-tuning frameworks for open source LLMs
Experiment tracking and model registries
Pipeline orchestration engines
Cloud-based notebooks

To simplify common workflows and MLOps tasks, Cake provides Recipes—pre-assembled infrastructure components, pipelines, and example notebooks. These Recipes accelerate project timelines for use cases such as Advanced RAG, GraphRAG, learning agents, recommendation engines, churn modeling, image segmentation, dynamic pricing, and many more.

Each Recipe includes a fully integrated component stack to support seamless development, testing, and deployment.

GraphRAG-Cake-Slice — **Example stack for GraphRAG on the Cake platform**

Kubernetes at the Core

Cake is built on Kubernetes, providing a scalable, secure, and resource-efficient foundation capable of managing demanding AI workloads. Cake lets you exploit the advantages of Kubernetes without experiencing the complexity. Data scientists can deploy models, run experiments, and access resources without requiring Kubernetes expertise.

Kubernetes gives Cake dynamic scaling, workload isolation, and seamless integration with AI tools, creating a robust environment for both development and deployment. Kubernetes also provides self-healing capabilities allowing Cake to minimize downtime by automatically restarting failed pods and performing rolling updates with minimal disruption. These features ensure reliability, scalability, and cost efficiency in AI operations.

Kubernetes also provides numerous cost and flexibility advantages. For example, Kubernetes makes Cake cloud-agnostic, supporting multi-cloud and on-prem deployments. This versatility would be almost unattainable without Kubernetes. Additionally, Kubernetes provides substantial cost savings compared to common alternatives because its lightweight container architecture enables workloads to share nodes efficiently. Finally, unlike platforms such as SageMaker, Vertex AI, or Databricks, Cake leverages standard cloud VM instances without incurring AI-specific costs, avoiding the “AI tax.” (e.g., as of this writing SageMaker instances cost on average 40% more than EC2 instances that can be recruited by EKS).

DeploymentDiagram_Updated__1_ — **Cake Deployment Architecture**

Click to view full size

Beyond Kubernetes: SkyPilot and Slurm

Cake’s integration of SkyPilot enhances Cake’s AI infrastructure by enabling the seamless execution of non-Kubernetes workloads. SkyPilot facilitates cost-efficient, multi-cloud deployments via a streamlined interface that optimizes resource utilization and automates spot instance management. SkyPilot supports over a dozen cloud and on-prem configurations. It dynamically selects the most cost-effective and high-performing nodes that can satisfy the requirements for your workload. These can be across AWS, GCP, Azure, LambdaLabs and others.

Features such as built-in fault tolerance, dynamic scaling, and AI/ML workflow support simplify tasks such as distributed training and hyperparameter tuning. Additionally, SkyPilot ensures resilience with workload checkpointing and recovery while maintaining reproducibility through consistent configurations, making it an indispensable tool for managing scalable, reliable, and cost-effective AI systems.

Beyond SkyPilot, Cake supports Slurm for deployments in non-cloud environments where data locality or on-premise cost considerations are paramount. For example: Slurm is a powerful option if you need to train models in multiple hospitals and combine them using a federated learning framework (e.g., Flower) because patient data cannot be egressed. This flexibility is crucial for this kind of specialized use case.

Cake’s platform can also be configured for seamless access from developer or data scientist laptops, enabling IDEs like Visual Studio or PyCharm to interact remotely with cloud-based Jupyter, RStudio, or VS Code notebooks. Local testing of Ray training or serving code further enhances the development workflow.

Infrastructure as Code (IaC) and GitOps

Cake leverages Infrastructure as Code (IaC) principles, integrating Terraform, Crossplane, and ArgoCD to deliver a scalable and consistent platform for managing AI infrastructure. All changes are version-controlled, auditable, and reviewable before deployment.

Terraform handles provisioning and management of core cloud resources including GPU-enabled nodes and storage, forming the backbone of Cake’s infrastructure. ArgoCD handles Kubernetes application deployment using the same GitOps principles as Terraform, by installing desired Helm and Kustomize to the platform. Finally, Crossplane enables Kubernetes to dynamically provision additional cloud resources as ArgoCD adds or removes open-source applications from the running platform.

This combination of IaC software provides declarative, versioned configurations for AI applications, enhancing auditability, collaboration, and disaster recovery. The approach minimizes manual errors, optimizes resource utilization, and accelerates deployment across the AI lifecycle, from data processing to model training and deployment.

Scaling and Cost Optimization

Cake leverages advanced Kubernetes autoscalers such as Karpenter to dynamically scale training and inference clusters from zero to thousands of nodes, optimizing for specific node types. Need fast GPUs for fine-tuning and cost-effective GPUs for inference? Cake’s autoscalers ensure resources are allocated efficiently.

By continuously monitoring cloud resource usage and storing metrics in Prometheus, Cake provides detailed insights into workload costs. Resource Quotas and Limits help you prevent resource overuse, mitigating unexpected expenses.

For distributed computing, Cake currently integrates Ray and Spark. Ray supports low-latency, scalable AI workloads such as distributed training, hyperparameter tuning, and real-time inference, while Spark excels at big data preprocessing, ETL, and streaming tasks with Spark SQL and big data ecosystem integration. These tools together enable high-performance, scalable, fault-tolerant workflows, streamlining data processing and model deployment.

Cake also supports a range of inference engines and model servers, including vLLM, Triton Tensor-RT, RayServe, and KServe ModelMesh. These leverage Kubernetes autoscaling to optimize fractional GPU usage, implement 3D DeepSpeed parallelism for cost-effective deployment on cheaper GPUs, handle complex inference graphs, and enable advanced deployment strategies such as canary deployments and A/B testing.

State-of-the-Art Security Architecture

The Cake platform is designed to meet critical security objectives that align with the interests of all stakeholders:

Maximize productivity by implementing role-based access controls in Kubeflow, ensuring users have the least privilege necessary to access proprietary data.
Minimize risk from unauthorized or accidental changes by requiring appropriate review and testing processes based on the potential impact of changes on production systems.
Enhance observability by providing audit logs and other vital information necessary for effective incident response, non-repudiation, and forensic analysis.
Ensure availability and stability at a level that matches the business criticality of the services provided, maintaining reliable and robust performance.

Cake’s integration of an Envoy gateway, an Istio service mesh, and OAuth 2.0 significantly strengthens Kubernetes’ role-based access control (RBAC) security by enabling robust service-to-service authentication, authorization, and encryption.

Istio, the service mesh, efficiently manages traffic between AI microservices within the Cake platform, working seamlessly with Envoy, a high-performance proxy, to enforce critical security measures such as mutual TLS (mTLS) for encrypted communication. This ensures that only authorized services can interact and that all communications remain secure. Additionally, Istio’s policy engine provides fine-grained access control by leveraging attributes such as identity and metadata to enforce restrictions. Complementing these tools, OAuth 2.0 enhances user authentication and authorization by integrating with corporate identity providers, delivering a comprehensive security framework.

The Cake platform implements robust, high-level threat mitigations to ensure security and compliance:

Encryption-in-transit: All access to data in transit is strongly encrypted
Encryption-at-Rest: Critical security data is protected via encryption at rest using cloud Key Management Services (KMS) and Kubernetes secrets.
Principle of Least Privilege: Permissions are tightly scoped to enable only necessary operations, leveraging Kubernetes Role-Based Access Control (RBAC).
User Authentication: Access to platform services requires authentication enforced through a service mesh and gateway via OIDC authentication, supporting Cloud IAM, LDAP, and Active Directory.
Network Segmentation: Cake is deployed in a dedicated AWS account or GCP project with a new VPC, ensuring deliberate routing to critical organizational resources.
Audit Logging and Non-repudiation: User activity is tracked via cloud provider logs and service mesh sidecars. Cluster changes are managed through pull requests to a client-controlled IaC repository, with component logging directed to secure cloud logging services.
Data Integrity: Privileges to modify data are tightly restricted. Default profile namespaces require explicit permissions for data access, and well-protected data warehouses or cloud-managed databases ensure integrity.
PII Handling: Designed to facilitate proper handling of PII/PHI, the platform restricts user access to scoped namespaces and supports data sanitization in training while allowing full datasets for production inference. Tools for PII best practices can be integrated into workflows.
Code Scanning: Docker images are automatically scanned, and SBOMs are generated for compliance.
Certification: The platform itself is SOC 2 and HIPAA certified, simplifying certification processes for users.

CakeSecurityDiagram — **Example architecture for tenant isolation within Cake**

Monitoring and Observability

Cake leverages Prometheus as its core monitoring solution. Metrics from all installed open-source applications are written to Prometheus, enabling visualization in dashboards. Cake integrates Prometheus with the Istio service mesh and Envoy gateway to capture detailed request/response metrics for enhanced observability. Prometheus provides a robust time-series database and PromQL query language, enabling efficient scraping, storage, and analysis of system, application, and custom metrics from Kubernetes clusters. Its seamless integration with Kubernetes—using Service Discovery and labels—supports dynamic environments.

Cake extends Prometheus with advanced dashboarding and alerting capabilities, offering real-time insights into AI workloads, model performance, and resource utilization through dynamic dashboards. Customizable, rule-based alerting ensures teams are promptly notified of critical issues such as resource bottlenecks or degraded AI model performance, maintaining system reliability and high availability.

Additionally, Cake includes Retrieval-Augmented Generation (RAG) monitoring tools to capture logs and debugging data, pushing this information to standard cloud solutions like CloudWatch or vendor solutions like DataDog. Cake’s pipeline and workflow engines further process these logs, integrating them into labeling tools such as Label Studio to build datasets for fine-tuning models. They also support evaluation of auto-generated prompts from tools such as DSPy, assessing metrics such as cost, latency, token usage, and accuracy on various models, including fine-tuned LLMs.

Cake monitors data drift and can automatically trigger retraining pipelines based on this metric, ensuring models remain performant and aligned with evolving data distributions.

Data Engineering and Storage

The Cake platform encrypts all data by default, restricting access to specific users or projects. However, it also enables dataset sharing through multiple channels, including CakeFS, a shared POSIX file system for seamless data exchange between notebooks and applications, and support for object storage solutions such as S3.

Cake’s data engineering tools optimize pipelines to deliver clean, reliable, and accessible data for AI model development and deployment. These tools integrate with databases, data warehouses, and data lakes, allowing efficient data extraction, transformation, and centralization, along with privacy preservation. For quick access to data locked in systems such as ticketing platforms or S3, Cake simplifies integration and ensures rapid availability for use cases such as Retrieval-Augmented Generation (RAG).

Cake also transforms raw data into analysis-ready formats, facilitating feature engineering and maintaining consistency and quality for training. It automates end-to-end pipeline orchestration, managing ingestion, preprocessing, and model training workflows while handling dependencies and scheduling. Additionally, Cake provides open-source solutions for vector databases, graph databases, and feature stores, enhancing scalability, automation, and reproducibility. These capabilities reduce engineering overhead and enable faster, more robust AI development.

The Bottom Line

Cake is more than just a platform; it’s a unified AI ecosystem designed to handle the most complex and highly scaled workloads with relative ease. By abstracting the underlying infrastructure while maintaining full control and observability, Cake empowers data scientists and AI and MLOps engineers to focus on innovation, not infrastructure.

From Kubernetes-powered scalability to robust security, integrated monitoring, and seamless data workflows, Cake is the platform that takes AI from concept to production—without compromise.

Skyler Thomas

Skyler is Cake's CTO and co-founder. He is an expert in the architecture and design of AI, ML, and Big Data infrastructure on Kubernetes. He has over 15 years of expertise building massively scaled ML systems as a CTO, Chief Architect, and Distinguished Engineer at Fortune 100 enterprises for HPE, MapR, and IBM. He is a frequently requested speaker and presented at numerous AI and Industry conferences including Kubecon, Scale-by-the-Bay, O’reilly AI, Strata Data, Strat AI, OOPSLA, and Java One.