Cost Cutting. The Approach.

Harjeet Singh
9 min readDec 9, 2023

--

Image source- Generated by me using Canvas Gen AI

You would have come across this word in recent times more often than not being used in different slack channels or thrown around in meetings. Most Tech companies across different sectors and domains are trying to cut costs and improve operational efficiency. So, either you or some colleague of yours must be working or have worked recently on this or will face this challenge sooner or later.

So, What is Cost Cutting?

Let’s Begin. Software Development is a tedious process involving using different tools, cloud services, off-the-shelf software, etc, which results in huge costs for such projects and ultimately the whole company as it develops or maintains these workloads. Generally, Major cost-contributing factors include:

  • Storage- Companies must store the data, both application, internal, customers, and other forms of data. Although Data Storage Costs have improved over time with AWS S3 Charging as low as 0.023$/GB per month in the Mumbai Region just due to the sheer size of the data produced in companies these days, it is one of the main culprits of Overall costs incurred. You must be thinking why is cost a huge factor? Well, it may or may not be depending on what the firm does. If you are in the telecom industry, Call logs have to be stored as per government regulations for up to 3 years.
  • Compute — Next comes the holy grail. The compute. For any tangible output, there has to be some processing and for processing to happen, there has to be infrastructure to power that processing. I am talking about memory and storage for computation and networks for data transfer. Firms these days use AWS EC2 or similar cloud provider compute instances. Although there is a different kind of pricing (on-demand spot), unchecked compute costs (coupled with network transfer, API request cost, etc) can quickly bleed you dry. I will talk more about how computing costs are tricky and how we we can reduce them.
  • Off-the-shelf Services/Tools- Someone who is a car manufacturer doesn’t necessarily need to build the tires as well, for it's a huge investment, and needs expertise, history, cost-per-unit to make sense, and so on. So they can simply buy tires for their cars from a Tyre company. Similar things happen in the tech industry. To build Software / Services you need more software, tools that would support functionalities, and cloud services that would aid in interacting with those tools and the list is endless. For eg: To present data to stakeholders, companies purchase licenses for BI tools such as Looker or Tableau. These licenses tend to have costs associated with them and generally have usage limits. So, the more users, the more licenses you need, the more you have to pay.
  • Enterprise Support- Many Times companies prefer enterprise support from the companies they are buying tools/services/infrastructure from. To carry on with our car example above. After a customer purchases a car, he/she might buy additional services such as On-road assistance, free pick-drop after for car maintenance, multi-year insurance packages, etc. The car manufacturer can’t enforce it, it's the choice of the customer. Similarly, Software companies might opt for enterprise support that could include support for any issues faced with the tool, early access to new features/beta releases that might come in open-source versions months later, negotiated prices, etc.

There might be more categories as the overall software development evolves over time. With the consistent debate of build vs buy, monolith vs microservices, fail-fast vs build slowly over time, etc, I have tried to confine cost angle in these major categories but there might be more now and in future.

Where Do I even start?

We’ll assume we have got a similar task. Now, what to do? and where to start looking. First things first, get all the data. Generally, there are dashboards for indicating the services, and infrastructures to which cost is attributed too. If you don't see it, build one. There is no point without a dashboard if you don't what the cost was before, what it is now, if you can’t visualize it, and lastly, no medium to show the results. One can still do the above things but it becomes very difficult to track costs daily/weekly/monthly etc and visualize the difference.

Once your dashboard is sorted, you’ll be able to visualize the multiple graphs that you would have created into different categories. Let’s pick storage.

Image source — google images

Let’s say you visualize the entire S3 storage and see some costs. This is step One. The next step is going deeper into how much data you have. This is a long and tedious process. AWS Explorers can directly give you storage per bucket. Now comes the tricky part, going deeper into what services are populating this data.

Q. Do You need to retain all this data?
->No.
What can you delete? Are there logs being stored? Can you put some Lifecycle retention policy here? What buckets are stage and dev buckets where some sort of cleaning can happen? More often than not you would realize that stage and dev buckets are good candidates and all you need are tags that can be added to their metadata by services producing this data so that the same tag can be used in a lifecycle policy. People tend to overlook this powerful service provided by AWS, if used wisely it can save so much costs.

->Yes.
There are times when you can not delete any data. But do you need all the data all the time? You would have to go back and see the access patterns of this data. You can do that by simply observing what jobs are reading this data(by checking the data range or DateTime filters) or simply S3 API requests on the prefixes for different buckets. If some data is accessed infrequently or rarely, you can move it to the Infrequent Access or Glacier storage Tier. To Put things into numbers perspective, for the APAC-Mumbai region Standard storage cost is $0.023/GB/month and the Glacier Deep Archive storage cost is $0.002 per GB/month. Now if your data size is 5 PB and let's say 50% of this data is accessed once or twice in a year, just switching to Glacier deep archive will save you > $50,000 per month.

Now one can argue that I have chosen extreme ends, absolutely yes. But which storage tier works for you will depend solely on the workloads, access patterns, requirements, and your far-sightedness on the use cases to come.

S3 costs also include a very keen component, the access cost. Any upload, download, or listing of the objects you store in S3 has costs associated with it.
For instance, you were analyzing the workloads and you see that there are airflow dags run by an analyst that just queries some trino table for newer data. This dag runs every 15 mins. From a higher level, things look normal. But what exactly is a trino table? Ultimately its data is stored in S3 /HDFS/Azure/some other storage, combined with metadata that provides the table definition. If we pick S3 as the data store, every time the dag runs, it queries data from S3, depending on how much is returned from Trino’s cache, most times it will result in a list bucket, head object, and get object calls.
Does the dag need to check data every 15 minutes? Can we increase it to one hour? If there are hundreds of dags doing the same thing, can you find some sort of patterns that exist between them where you see a window for optimizations? A small tweak from 15 minutes to one hour (if the use case permits) or a very custom thing such as running dag only when new data arrives for a set of tables will end up saving more costs than you can imagine. Maybe the data producer can send a notification to trigger dags.

What can you do with data formats? Is the data stored in the best format available? If the consumers are reading via an abstraction (Trino, Tableau, looker), can you save the data in a more compressed format such Orc or Parquet?

So its all about understanding where the costs comes from, trying to see what is causing the cost and try to find patterns that will form the base of optimizations.

What Next?

Image source — google

So far so good. We found out that test data were lying on S3 buckets, logs that kept on piling up for months (we applied a 30-day or 60-day Lifecycle policy), cleaned-up buckets, etc. We also observed that we could do with less refreshing of Tableau dashboards etc.

How do we get into Hygiene of Cost savings? What if the costs shoot up again? We can’t keep doing this indefinitely. Anyone who has worked on cost savings, analyzing where the cost is going, and finding patterns to cut costs would know it takes an immense amount of time to get this right, takes multiple iterations and sometimes all this effort would go to waste — COURTESY- SPOT LOSS for EC2 Instances (too much pain there, sorry).

Carrying on with S3 as an example (similar things are there for other cloud services), we need to set up alerts for the cost increase. What kind of Alerts?
Again it will largely depend on your use case, jobs, and the key “access_patterns”. If there’s a service that consumes data, process, and store it and you know that the average data per month is around 500 GB and there is an alert saying last month's data is way beyond 500 GB you need to act on the alert and check is it just normal traffic increase or somewhere something has gone haywire.

Similar alerts can be set up on Put Requests, Get Requests, and list Requests based on your job’s traffic patterns, the more (and better) alerts you have, the earlier you can act on them and not let your cost increase. Does all your stage and dev workloads are properly tagged and written to specific buckets? Do all these buckets have an LC set? Logs Hygiene? You see where I am headed, right?
AWS provides setting up alerts and the documentation is very easy to follow. You can and should also set up alerts on the dashboards you created above (Looker, Tableau, Grafana, etc).

Conclusion

Cost Cutting is tricky, even when there are clear candidates, you might find it hard to rate limits, delete data, and control the resources provisoned for the job. Whilst I discuss Storage in this article in detail, similar things can be done for computing, and other categories mentioned above.

To touch upon it slightly, you have your classic Spot instances which follow a bidding process and are a fraction of the on-demand cost of Ec2 instances one uses for computing. People follow multiple approaches from scaling down at night time, to index creation (on Databases) to Z-order on delta files and multiple things to improve on time. With Ec2 You always want to optimize for time, either you run your batch workload at a time when there is less bidding for Spot Instances (early morning?), pick an instance type that has less interruption rate, or explore cost-saving instance types (Graviton?) or on the other hand you would find your self optimizing for the time differently. Can you combine these compute loads together in one batch? Are Pods sitting idle? and the list goes on and on.

Similar arguments happen for build vs buy-on off-the-shelf tools, most companies prefer to use open-source versions of all software/services and unless it's absolutely necessary, buy enterprise versions. These are tricker decisions and come with experience.

To summarise:

  • Clean up the old mess — Most of the time things are not done in an ideal way, one of the first things to do in all the above categories is to clean up the previous mess. It's the most time-consuming and boring thing to do but is the core step to cut costs. Will also help in operational efficiency but that's just a happy by-product.
  • Acces Patterns — Cant stress enough on it. The more you dig deep and find places for optimizations by analyzing the storage, compute, workloads, the proprietary software on how they are being used, why they were chosen in the first place, and why a particular workload works this way (Does a job always has to poll, why can't the producer push ? Huge difference depending on the volume and frequency.)

I have seen folks writing Trino Queries without passing a good date range filter, dags that just need the last day’s data for incremental refresh, and yet the code written is doing full refresh. Stakeholders not knowing what is real time and using realtime updates from Kafka when a batch job running every hour could do the same work for them.

  • Hygiene — Once the above two activities are done, set up alerts wherever you can follow up on your cost targets. The more time you spend on the above activities, the better hygiene process you can come up with and apply to your system.

Cheers.

--

--

Harjeet Singh

Problem Solver, writes on Tech, finance and Product. Watch out for my new creation, "THE PM SERIES"