How We Slashed Our EKS Bill by 40%: A Real Story

Published in

AWS in Plain English

5 min readJan 20, 2025

Last year, we inherited a messy EKS infrastructure that was burning through cash like crazy. Around $25K a month, to be exact. Our CTO wasn’t happy, and we had to figure out how to cut costs without breaking things. After months of trial and error, we managed to save over $100K annually. Here’s what actually worked for us.

The Wake-Up Call

I still remember the day our CFO walked into our team meeting with a concerned look. Our AWS bills had been climbing steadily, and EKS was the biggest culprit. The funny thing? We had no idea where all that money was going.

The first thing we did was break down our costs:
- EC2 instances for worker nodes (this was bleeding us dry)
- Storage (those EBS volumes add up fast)
- Data transfer (cross-AZ traffic was killing us)
- Load balancers (we had way too many)

Finding the Low-Hanging Fruit

Our first win was embarrassingly simple. We had been running our dev and staging environments 24/7. Why? “Because we always did it that way.” A quick script to shut down non-prod clusters during off-hours instantly saved us $3K monthly.

Here’s what we used:

# eks-scheduler.yaml
apiVersion: batch/v1
kind: CronJob
metadata:
  name: cluster-scaler
spec:
  schedule: "0 20 * * 1-5"  # 8 PM weekdays
  jobTemplate:
    spec:
      template:
        spec:
          containers:
          - name: cluster-ops
            image: bitnami/kubectl
            command:
            - /bin/sh
            - -c
            - kubectl scale deployment --all --replicas=0

The Worker Node Saga

Next up was our worker node setup. We were running m5.2xlarge instances across the board because someone read it was “good for general use.” Classic.

After actually looking at our usage patterns (thank you, Prometheus), we realized something interesting: our Java services were memory hogs, but CPU usage was minimal. Meanwhile, our Go services were CPU-intensive but light on memory.

We split our workloads:

# Before: One size fits none
nodeGroups:
  - name: workers
    instanceType: m5.2xlarge
    desiredCapacity: 30

# After: Mix and match
nodeGroups:
  - name: java-services
    instanceTypes: ["r5.xlarge", "r5a.xlarge"]
    desiredCapacity: 20
  - name: go-services
    instanceTypes: ["c5.large", "c5a.large"]
    desiredCapacity: 10

This simple change cut our EC2 costs by 35%. The team thought I was a genius. I didn’t tell them it was just common sense.

The Spot Instance Adventure

Everyone talks about using spot instances, but let me tell you — our first attempt was a disaster. We tried to run everything on spot, and our services went down during peak hours when spot prices spiked.

Here’s what actually worked:

1. Use spot instances for stateless workloads only
2. Set up a mixed instance policy
3. Implement proper pod disruption budgets

Here’s our battle-tested setup:

apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig
nodeGroups:
  - name: spot-workers
    instanceTypes: ["m5.xlarge", "m5a.xlarge", "m5n.xlarge"]
    desiredCapacity: 5
    minSize: 3
    maxSize: 15
    spot: true

Pro tip: Always keep some on-demand instances for critical workloads. Trust me, your sleep schedule will thank you.

Storage: The Silent Budget Killer

Nobody paid attention to our storage costs until we noticed we were spending $3K monthly on mostly empty volumes. The culprit? Every developer was requesting 100GB volumes because “storage is cheap.”

We implemented storage quotas:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: storage-quota
spec:
  hard:
    requests.storage: 500Gi
    persistentvolumeclaims: "10"

And switched to gp3 volumes:

apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  name: gp3-standard
provisioner: ebs.csi.aws.com
parameters:
  type: gp3
  iops: "3000"

Real Talk About Results

Photo by National Cancer Institute on Unsplash

After six months of optimization:
- Monthly AWS bill: Down from $25K to $15K
- Application performance: Actually improved (funny how that works)
- Team happiness: Way up (fewer 3 AM calls)

Lessons Learned the Hard Way

1. Start with monitoring. You can’t optimize what you can’t measure.
2. Don’t try to optimize everything at once. We broke our cluster three times by being too aggressive.
3. Get your developers involved. They know their applications better than anyone.

What’s Next?

We’re looking at Graviton2 instances now. Initial tests show another 20% potential savings. I’ll probably write another post about that adventure once we’ve finished testing.

That’s our story. Not glamorous, but it worked. What’s your experience with EKS costs? Any horror stories or wins to share? Drop a comment below — I’d love to hear them.

Found this helpful? Let’s connect!

🔗 Follow me on LinkedIn for more tech insights and best practices.

💡 Have thoughts or questions? Drop them in the comments below — I’d love to hear your perspective.

If this article added value to your day, consider giving it a 👏 to help others discover it too.

Until next time!

Thank you for being a part of the community

Before you go:

Be sure to clap and follow the writer ️👏️️
Follow us: X | LinkedIn | YouTube | Newsletter | Podcast
Check out CoFeed, the smart way to stay up-to-date with the latest in tech 🧪
Start your own free AI-powered blog on Differ 🚀
Join our content creators community on Discord 🧑🏻‍💻
For more content, visit plainenglish.io + stackademic.com

How We Slashed Our EKS Bill by 40%: A Real Story

Thank you for being a part of the community

Sign up to discover human stories that deepen your understanding of the world.

Free

Membership

Published in AWS in Plain English

Written by Yash Thaker

Responses (9)