How We Slashed Our EKS Bill by 40%: A Real Story

Yash Thaker
AWS in Plain English
5 min readJan 20, 2025
Photo by Alexander Mils on Unsplash

Last year, we inherited a messy EKS infrastructure that was burning through cash like crazy. Around $25K a month, to be exact. Our CTO wasn’t happy, and we had to figure out how to cut costs without breaking things. After months of trial and error, we managed to save over $100K annually. Here’s what actually worked for us.

The Wake-Up Call

Photo by bruce mars on Unsplash

I still remember the day our CFO walked into our team meeting with a concerned look. Our AWS bills had been climbing steadily, and EKS was the biggest culprit. The funny thing? We had no idea where all that money was going.

The first thing we did was break down our costs:
- EC2 instances for worker nodes (this was bleeding us dry)
- Storage (those EBS volumes add up fast)
- Data transfer (cross-AZ traffic was killing us)
- Load balancers (we had way too many)

Finding the Low-Hanging Fruit

Photo by Eran Menashri on Unsplash

Our first win was embarrassingly simple. We had been running our dev and staging environments 24/7. Why? “Because we always did it that way.” A quick script to shut down non-prod clusters during off-hours instantly saved us $3K monthly.

Here’s what we used:

# eks-scheduler.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
name: cluster-scaler
spec:
schedule: "0 20 * * 1-5" # 8 PM weekdays
jobTemplate:
spec:
template:
spec:
containers:
- name: cluster-ops
image: bitnami/kubectl
command:
- /bin/sh
- -c
- kubectl scale deployment --all --replicas=0

The Worker Node Saga

Photo by Kate Ferguson on Unsplash

Next up was our worker node setup. We were running m5.2xlarge instances across the board because someone read it was “good for general use.” Classic.

After actually looking at our usage patterns (thank you, Prometheus), we realized something interesting: our Java services were memory hogs, but CPU usage was minimal. Meanwhile, our Go services were CPU-intensive but light on memory.

We split our workloads:

# Before: One size fits none
nodeGroups:
- name: workers
instanceType: m5.2xlarge
desiredCapacity: 30

# After: Mix and match
nodeGroups:
- name: java-services
instanceTypes: ["r5.xlarge", "r5a.xlarge"]
desiredCapacity: 20
- name: go-services
instanceTypes: ["c5.large", "c5a.large"]
desiredCapacity: 10

This simple change cut our EC2 costs by 35%. The team thought I was a genius. I didn’t tell them it was just common sense.

The Spot Instance Adventure

Photo by Howard Malone on Unsplash

Everyone talks about using spot instances, but let me tell you — our first attempt was a disaster. We tried to run everything on spot, and our services went down during peak hours when spot prices spiked.

Here’s what actually worked:

1. Use spot instances for stateless workloads only
2. Set up a mixed instance policy
3. Implement proper pod disruption budgets

Here’s our battle-tested setup:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
nodeGroups:
- name: spot-workers
instanceTypes: ["m5.xlarge", "m5a.xlarge", "m5n.xlarge"]
desiredCapacity: 5
minSize: 3
maxSize: 15
spot: true

Pro tip: Always keep some on-demand instances for critical workloads. Trust me, your sleep schedule will thank you.

Storage: The Silent Budget Killer

Photo by Shane on Unsplash

Nobody paid attention to our storage costs until we noticed we were spending $3K monthly on mostly empty volumes. The culprit? Every developer was requesting 100GB volumes because “storage is cheap.”

We implemented storage quotas:

apiVersion: v1
kind: ResourceQuota
metadata:
name: storage-quota
spec:
hard:
requests.storage: 500Gi
persistentvolumeclaims: "10"

And switched to gp3 volumes:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: gp3-standard
provisioner: ebs.csi.aws.com
parameters:
type: gp3
iops: "3000"

Real Talk About Results

Photo by National Cancer Institute on Unsplash

After six months of optimization:
- Monthly AWS bill: Down from $25K to $15K
- Application performance: Actually improved (funny how that works)
- Team happiness: Way up (fewer 3 AM calls)

Lessons Learned the Hard Way

1. Start with monitoring. You can’t optimize what you can’t measure.
2. Don’t try to optimize everything at once. We broke our cluster three times by being too aggressive.
3. Get your developers involved. They know their applications better than anyone.

What’s Next?

We’re looking at Graviton2 instances now. Initial tests show another 20% potential savings. I’ll probably write another post about that adventure once we’ve finished testing.

That’s our story. Not glamorous, but it worked. What’s your experience with EKS costs? Any horror stories or wins to share? Drop a comment below — I’d love to hear them.

Found this helpful? Let’s connect!

🔗 Follow me on LinkedIn for more tech insights and best practices.

💡 Have thoughts or questions? Drop them in the comments below — I’d love to hear your perspective.

If this article added value to your day, consider giving it a 👏 to help others discover it too.

Until next time!

Thank you for being a part of the community

Before you go:

Sign up to discover human stories that deepen your understanding of the world.

Free

Distraction-free reading. No ads.

Organize your knowledge with lists and highlights.

Tell your story. Find your audience.

Membership

Read member-only stories

Support writers you read most

Earn money for your writing

Listen to audio narrations

Read offline with the Medium app

--

--

Published in AWS in Plain English

New AWS, Cloud, and DevOps content every day. Follow to join our 3.5M+ monthly readers.

Written by Yash Thaker

Senior DevOps Engineer & AWS Community Builder | Cloud Infrastructure Automation Enthusiast | Tech Writer Sharing AWS & DevOps Best Practices

Responses (9)

What are your thoughts?