StackSimplify | DevOps & Cloud Education by Kalyan Reddy

The Complete MLOps Platform: 25 Posts, 8 Layers, One Architecture

Thu, 07 May 2026 00:00:00 +0000

25 posts. One platform. Every tool a DevOps engineer already knows.

When this series started in February, MLOps felt like a separate discipline. Specialized tools. Unfamiliar workflows. A whole new vocabulary that seemed disconnected from everything you already knew.

25 posts later, here is what actually happened: every single pattern mapped back to something you have been doing for years.

The Complete Architecture

Eight layers. Each solves a specific production problem.

MLOps Maturity Model: From Notebooks to Platform in 5 Levels

Tue, 05 May 2026 00:00:00 +0000

Level 0: Jupyter notebook in production. Level 4: Fully automated ML lifecycle.

Most teams think they are somewhere in the middle. Most teams are wrong.

Here is the MLOps Maturity Model. Five levels, from chaos to platform.

The Five Levels

Level	Name	What It Looks Like
0	Manual	Notebooks copied to prod. No versioning. Single person dependency.
1	Managed	Model registry, basic monitoring, manual retraining with a process.
2	Automated	CI/CD pipelines, automated retraining triggers, quality gates.
3	Governed	Feature stores, A/B testing, drift-triggered retraining, RBAC, audit trails.
4	Optimized	Multi-model platform, GPU scheduling, cost optimization, self-healing.

Level 0: Manual

Notebooks copied to production servers. Models deployed by the person who trained them. No versioning. No monitoring. No rollback plan.

Multi-Model Serving on Kubernetes: 50 Models, One Cluster

Wed, 22 Apr 2026 00:00:00 +0000

50 models. 10 active. 40 at zero. One cluster.

That is the reality of a mature ML platform. Not one model per team. Not one namespace per endpoint. Dozens of models sharing infrastructure, scaling independently, and costing almost nothing when idle.

Most teams never get here. They get stuck at the single-model trap.

The Single-Model Trap

Team A deploys their fraud model. Gets its own namespace, its own Istio gateway, its own monitoring stack. Works great.

ML Security on Kubernetes: 4 Layers Protecting Your Models

Sun, 19 Apr 2026 00:00:00 +0000

Your model endpoint has no auth. Anyone with the URL gets predictions.

That is not a hypothetical. It is the default on most KServe deployments. Deploy a model, get an endpoint, and it is wide open. No token. No identity check. No network restriction.

ML systems have a unique attack surface: training data, model artifacts, feature stores, and inference endpoints. Each one is a target.

GPU Scheduling on Kubernetes: MIG, Time-Slicing, and Node Pools

Fri, 17 Apr 2026 00:00:00 +0000

One NVIDIA A100 GPU costs $3 per hour on AWS. Your inference pod uses 12% of it. The other 88% sits idle, billed, and wasted.

Kubernetes schedules GPUs as whole devices by default. One pod gets one GPU. No sharing. No slicing. Massive waste for inference workloads.

The Problem: One GPU, One Pod

A fraud detection model needs 2GB of GPU memory and runs a few requests per second. The node has an A100 with 40GB. Kubernetes assigns the whole GPU to that one pod.

Batch vs Real-Time ML Inference: 90% of Predictions Can Be Batch

Wed, 15 Apr 2026 00:00:00 +0000

Your model runs in real-time. 90% of your predictions do not need to.

That is the most expensive assumption in ML infrastructure. A recommendation engine that refreshes daily does not need always-on pods. A credit risk score computed once at application time does not need a replica running at 3 AM.

Most teams default to real-time because that is how their first model shipped. Every model after inherits the same pattern. And the same bill.

5 Levels of ML Model Deployment on Kubernetes

Tue, 14 Apr 2026 00:00:00 +0000

You deploy containers to Kubernetes every day. But how do you deploy ML models?

There are 5 levels. Each adds production capabilities. Here’s the progression.

The 5 Levels

Level	Pattern	DevOps Equivalent	When to Use
L1	Baked Image	Static binary in container	Learning, simple models
L2	MLflow Dynamic	Config from external store	Versioned, no rebuild
L3	KServe Predictor	Deployment + HPA + Ingress	Scalable, zero downtime
L4	KServe Transformer	Sidecar pattern	Modular, independent scaling
L5	KServe Explainer	Audit logging	Compliance, GDPR

Level 1: Baked Image

Model baked into the Docker image at build time. Simple: docker build, kubectl apply, done.

5 Questions to Ask Before Every ML Model Deployment

Tue, 14 Apr 2026 00:00:00 +0000

A data scientist hands you a model.pkl and says “deploy this.”

What do you ask?

Most engineers jump straight to containers and endpoints. But the questions that save you at 2 AM are the ones you ask before deployment, not during an incident.

The Checklist

#	Question	Why It Matters
1	What input will break it?	Models return garbage confidently on bad input
2	What’s the rollback plan?	“Redeploy the old one” is not a plan
3	How do we know it’s broken?	ML models fail silently with HTTP 200
4	What versions are pinned?	scikit-learn 1.3 vs 1.5 = model won’t load
5	Who gets paged at 2 AM?	Define ownership before production

1. What Input Will Break It?

Missing fields? Nulls? Negative values where the model expects positive?

A/B Testing for ML Models: When Offline Metrics Lie

Tue, 14 Apr 2026 00:00:00 +0000

You retrained the model. Accuracy went up 2% on the test set. You deployed it. Revenue dropped 5%.

What happened? Offline metrics lie. A model that scores better on historical data can score worse on real users.

Canary vs A/B Testing

Approach	Question It Answers	Traffic Split
Canary	“Does it break anything?”	10-20% to new model
A/B Testing	“Does it actually improve outcomes?”	50/50 to both models

You need both. Canary first, then A/B.

Canary Deployments for ML Models with KServe and Istio

Tue, 14 Apr 2026 00:00:00 +0000

You do canary deployments for APIs every day. Why not for ML models?

New model ready. Looks good in testing. Deploy to production. Hope it works. It doesn’t. Rollback takes 5 minutes. Five minutes of garbage predictions. Damage done.

How It Works

Role	Traffic	Description
Champion (80%)	Production traffic	Current model, proven, stable
Canary (20%)	Test traffic	New version, running alongside

Both run simultaneously. Same endpoint. Istio handles the traffic split.

CI/CD for ML: Same GitHub Actions, Different Artifact

Tue, 14 Apr 2026 00:00:00 +0000

Your CI/CD pipeline deploys code. Ours deploys models. Same tools.

GitHub Actions. ArgoCD. Docker. DVC. MLflow. Same stack you already run. The only difference is what triggers the pipeline and what gets deployed.

Code pipeline: git push > build > test > deploy ML pipeline: data change > retrain > evaluate > deploy

The 7-Job ML Pipeline

Job	What It Does	Failure Action
0. Preflight	7 infra checks in 5 min (MLflow up? MinIO? DVC?)	Fail fast
1. Data + Features	DVC pulls dataset, feature engineering runs	Stop on schema error
2. Train + Gate	Train candidate, compare vs champion	If candidate loses, skip Jobs 3-6
3. Export	Get champion model URI from MLflow	Stop on registry error
4. Build	Build transformer container	Stop on build error
5. GitOps	Patch KServe YAML, push to git	ArgoCD watches repo
6. Verify	ArgoCD syncs, health check, 3 smoke tests	Rollback on failure

Jobs 3 and 4 run in parallel.

Data Drift Detection: When Your Model Stops Being Right

Tue, 14 Apr 2026 00:00:00 +0000

Your model was trained on last year’s data. The world has moved on. Your model has not.

Your model can return predictions with perfect latency, zero errors, 200 OK on every request. And every single prediction can be wrong.

Operational monitoring tells you the model is running. Statistical monitoring tells you the model is still right.

The Three Types of Drift

Type	What Changed	Example
Data Drift	The inputs changed	Model trained on ages 25-45, now seeing ages 18-22
Concept Drift	The relationships changed	High frequency used to mean fraud, now means power user
Prediction Drift	The outputs changed	Fraud rate prediction jumped from 5% to 15%

The DevOps Parallel

Infrastructure monitoring: Is the server healthy?
Application monitoring: Is the app returning correct responses?
Data monitoring: Is the model still seeing the right inputs?

You wouldn’t skip application monitoring just because the server is healthy. Don’t skip data monitoring just because the model is running.

DevOps Thinking Applied to MLOps: 5 Essential Tools

Tue, 14 Apr 2026 00:00:00 +0000

If you’re a DevOps engineer and a data scientist has ever handed you a model.pkl and said “deploy this”, you know the feeling.

Where did this come from? What data trained it? Which version is this? How do I scale it?

Here’s what I’ve learned after months building MLOps pipelines: these aren’t new problems. We’ve already solved them in DevOps. The tools are different, but the thinking is identical.

DVC: Git for Your ML Training Data

Tue, 14 Apr 2026 00:00:00 +0000

You version code with Git. What about your model training data?

If you’ve ever asked “Which dataset trained this model?” or “Can we reproduce last month’s model exactly?”, you need DVC.

What DVC Solves

Problem	Without DVC	With DVC
Which dataset trained this model?	“Check the shared drive, maybe?”	`git log` shows exact data version
Someone changed the training data	No history, no diff	`dvc diff` shows exactly what changed
Reproduce last month’s model	Impossible	`git checkout` + `dvc checkout`

Your Weekend Starter

Six commands. That’s all you need. (Full DVC docs)

Feature Stores: The Package Registry for ML Features

Tue, 14 Apr 2026 00:00:00 +0000

Your training pipeline computes “average transaction amount” as the mean of the last 30 days. Your inference API computes it as the mean of the last 7 days.

Same feature name. Different values. Your model is silently wrong.

This is training-serving skew. The number one silent killer of ML models in production.

The Problem

ML features get computed in two places:

Context	How Features Are Computed	Problem
Training	Batch job on historical data, saved to CSV	Code written by data scientist
Serving	API computes on the fly per request	Different code, different logic

Two separate implementations. They drift apart over time. Nobody notices until revenue drops.

ML Cost Optimization: One YAML Field Cut Our Bill by 80%

Tue, 14 Apr 2026 00:00:00 +0000

We changed one YAML field from 1 to 0. Infrastructure cost dropped 80%.

The field: minReplicas.

When set to 1, your ML inference pod runs 24/7. Even at 3 AM when nobody is making predictions. That’s $50-150 per month per model, running idle.

When set to 0, the pod scales to zero when idle. Traffic arrives, the pod spins up. Traffic stops, the pod disappears. You pay only for what you use.

ML Governance: The Champion-Challenger Pattern for Model Deployment

Tue, 14 Apr 2026 00:00:00 +0000

Your ML serving code should never know about version numbers. Ever.

If your inference service loads fraud-detector-v47, you have a problem. What happens when v48 is ready? Code change. New deploy. Downtime risk.

Now imagine this: your service always loads the model tagged @champion. (MLflow Model Registry docs) When v48 is promoted, the tag moves. Next request gets the new model. Zero code changes. Zero downtime.

ML Model Monitoring: Your Grafana Dashboard Is Lying to You

Tue, 14 Apr 2026 00:00:00 +0000

Your ML model was 95% accurate when you deployed it. That was 6 months ago. Nobody has checked since.

A model can show 10% CPU, zero errors, healthy pod status. And still return garbage predictions. Your Grafana dashboard shows all green. Your customers see wrong results.

Why This Happens

Your monitoring tracks CPU, memory, and pod restarts. Your model cares about none of that.

ML Pipeline Orchestration with Kubeflow on Kubernetes

Tue, 14 Apr 2026 00:00:00 +0000

Your ML team has 47 Jupyter notebooks. 12 of them “should run in order.” Nobody remembers which 12.

One fetches data. Another cleans it. A third trains. A fourth evaluates. A fifth deploys. Different repos. Hardcoded paths. Two only work on Sarah’s laptop.

This is not a pipeline. This is a disaster waiting for a deadline.

Why ML Pipelines Are Different

Data pipelines move data from A to B. ETL. Airflow handles this well.

ML Retraining Pipelines: From Drift Alert to Production Model

Tue, 14 Apr 2026 00:00:00 +0000

Your drift detector triggered an alert. Now what?

Most teams freeze. The runbook says “retrain the model.” Nobody knows how. Monitoring without a retraining pipeline is like alerting without a runbook.

The Retraining Spectrum

Level	Trigger	Best For
Manual	Data scientist retrains in a notebook	Small teams, low-risk models
Scheduled	Cron job retrains every week/month	Predictable drift patterns
Triggered	Drift detector kicks off pipeline automatically	High-value models

Most teams should start with manual. Move to scheduled. Graduate to triggered.

MLflow in 60 Seconds: The Complete ML Model Lifecycle

Tue, 14 Apr 2026 00:00:00 +0000

How does an ML model actually get from training to production?

If you’re a DevOps engineer stepping into MLOps, MLflow is the first tool you need to understand. It handles the entire lifecycle: tracking experiments, versioning models, and serving them in production.

The 5-Step Lifecycle

Here’s the full journey of a model, from code to production.

Step	What Happens	DevOps Analogy
Experiment	Write training code, MLflow creates a “run”	Starting a CI build
Run	Logs parameters, metrics, model files	Build artifacts + test results
Model	Best run registered to Model Registry	Pushing image to Container Registry
Registry	Versions (v1, v2, v3) with aliases (@champion, @candidate)	Image tags (:latest, :staging, :prod)
Serving	API loads `models:/fraud-detector@champion`	K8s Deployment pulling :prod tag

Step 1: Experiment

You write training code and run it. MLflow automatically creates a “run” and starts tracking everything.

Scale-to-Zero for ML Models: Stop Paying for Idle Compute

Tue, 14 Apr 2026 00:00:00 +0000

Your ML model runs 24/7. Inference requests come 2% of the time. You’re paying for 98% idle compute.

This is the most expensive mistake in ML deployment. And the fix takes one YAML field.

How It Works

KServe + Knative handles this natively.

Your model is serving requests
Traffic drops. 30 seconds of silence
Knative scales pods to ZERO
New request arrives
Pod spins up in seconds. Request served.

Zero requests = zero pods = zero cost.

SHAP Explainability: Why Your ML Model Flagged That Transaction

Tue, 14 Apr 2026 00:00:00 +0000

Your ML model flagged a customer’s transaction. They call support and ask: “Why?”

If you can’t answer, you might be breaking the law.

GDPR Article 22 gives users the right to an explanation for automated decisions. Financial regulators require it. Healthcare demands it.

The Explanation

Instead of just HIGH RISK: 0.85, you get:

Feature	SHAP Value	Impact
Amount 5x higher than average	+0.32	Increases risk
International from unusual country	+0.21	Increases risk
Transaction at 3 AM local time	+0.15	Increases risk

Each number is a SHAP value. It tells you how much each feature pushed the prediction. Positive = increases risk. Negative = decreases risk.

The Two-Container Pattern: Transformer + Predictor for ML Serving

Tue, 14 Apr 2026 00:00:00 +0000

Your ML model expects clean features. Your API receives raw data. Where does the preprocessing live?

Every team gets this wrong the first time. They stuff everything into one container: data validation, feature engineering, ML inference, output formatting. It works. Until it doesn’t.

The Problem with One Container

Model retrained? Rebuild the whole container. Feature logic changed? Rebuild the whole container. Need to scale inference independently? Everything scales together. Or breaks together.

Quality Gates for ML: 4 Layers Between Training and Production

Sun, 12 Apr 2026 00:00:00 +0000

40% of our candidate models got rejected at the quality gate. That is not a failure rate. That is a protection rate.

Without quality gates, every model that finishes training goes to production. Good models. Bad models. Models trained on corrupted data. Models that score well on the test set but tank in production.

Quality gates ask one question before every deployment: is this model actually better than what we have?

5 Things I Wish I Knew Before Running EKS in Production

Thu, 26 Feb 2026 00:00:00 +0000

Running Amazon EKS in a tutorial and running it in production are two very different experiences. After deploying a 5-microservice retail store application with real AWS services, here are the five lessons that would have saved me time, money, and plenty of late-night debugging sessions.

1. Cluster Autoscaler Doesn’t Consolidate Nodes

Cluster Autoscaler only removes empty nodes. If a node is running a single tiny pod at 10% utilization, it stays — and you keep paying for it.

Building a Complete Observability Stack for EKS with OpenTelemetry and ADOT

Thu, 26 Feb 2026 00:00:00 +0000

Most Kubernetes observability setups are incomplete. Teams install Prometheus, wire up a few dashboards, and call it done. Then a production incident hits and they’re grepping through logs at 3 AM, trying to find a needle in a haystack.

The problem isn’t the tooling — it’s the approach. You need all three observability pillars working together: Traces, Logs, and Metrics. Here’s how I built a complete stack on EKS using AWS Distro for OpenTelemetry (ADOT).

How to Handle Spot Instance Interruptions on EKS with Zero Downtime

Thu, 26 Feb 2026 00:00:00 +0000

“Spot instances are too risky for production.”

That’s the most common objection I hear from DevOps engineers. And it’s wrong. With the right architecture, you can run production workloads on Spot instances with 70% cost savings and zero downtime during interruptions. Here’s exactly how.

The Fear (and Why It’s Overblown)

The concern is legitimate on the surface: AWS can reclaim a Spot instance with just 2 minutes of notice. Without preparation, your pods get terminated, requests fail, and users see errors.

5 Terraform Mistakes That Cost You Money on AWS

Wed, 25 Feb 2026 00:00:00 +0000

If you’ve been running Terraform on AWS for any length of time, chances are your infrastructure has a few hidden cost leaks. I’ve seen these patterns across hundreds of student projects and enterprise environments. Here are the five most common Terraform mistakes that silently drain your AWS budget — and how to fix each one.

1. Not Setting `instance_type` Defaults Wisely

Many engineers copy-paste t3.large or m5.xlarge from tutorials without right-sizing. In Terraform, you should use variables with sensible defaults:

AWS CloudFormation Simplified | Hands-On with YAML