Decoding the DOMAIN in Data

Harjeet Singh
7 min readMar 31, 2024

--

Software Development is a tricky process, the choice of infrastructure, technologies used and the process followed is based on sound principles of software design, but the domain you are in, and the particular business problem you are solving influences the choices mentioned above the largest. But Data engineering is one segment where folks feel the domain is not as relevant. But is it?

Image taken from internet.

The image above says it all . You need to understand the problem first before picking the tool.

When people think of data engineering, they immediately think of airflow, some data pipelines, and Sql. If I had a penny for every time someone compares data engineering with Sql, I would be a very rich man today. DE is not SQL and much more than that! Ok, I think I got carried away, coming back to our discussion, about data engineering. DE is all about the journey of data and more. But people seem to concentrate on the “data” part more.

  • How did the data arrive in the lake?
  • How the data is stored on the lake? What partition columns are used?
  • Who are the consumers of this data?
  • How will they access data? (Query/ Pull<>Push, Pub<>Sub, APIs, the list is endless).
  • Who can/can not access Data? (RBAC)
  • What is there in Data? (Discovery)
  • What is the file format used?
  • Which is the cloud infrastructure? AWS? GCP?
  • and hundreds of more questions on similar lines.

While these questions are relevant, this discussion is more tailored toward how the domain influences decision-making at one point or another. While it can be ignored to some extent and neither should it be thought of at every step, there has to be a middle ground just like Thanos said: “Perfectly Balanced, as all things should be”.

Domain or the business you are in does affect the what and why and especially “how” you do it. Software Engineers generally forget that ultimately they are part of a company that has a business and for that business to survive it has to make money and play by the rules laid down for that business in the context of the field the business is in. For eg- Every finance company in the world is governed by many rules by that country’s financial regulators.

Similarly, if you are a telecom operator in India TRAI (Telecom Regulatory Authority of India) mandates that the company has to store at least 2–3 years of internet usage and records of call data. This simply means you can't purge the data even if you have to. You have to think of data retention as an inherent feature of everything you design now. The more products you build on top of your core business, everything starts falling under some sort of regulations which depends on country to country. I know Data engineers are straightaway thinking of just dumping this data to the iceberg and having an LC (life cycle policy /AWS) of 2 years but it's not as straightforward as the government can request any amount of data at any time.

I just gave a small example above of how domain affects decision-making. So let’s say you are building a query system to query the data in near real time. The choice of the database the response time and the amount of data you scan vary greatly if you are a telecom company vs let’s say you are a bank. With a Bank, for example, if you request a current statement the bank will give you the last 6 months' data from their cache or Hot data storage and when you request a detailed statement you get a real-time instantaneous “message” (just kidding) saying that the actual data will be supplied to you over mail soon (abstract time). Now banks can run this process easily from cold storage since it is not bound to give you data in real time from hot storage and the architecture and the process can be a cost-effective, optimized method with relaxed parameters and precisions.

Coming back to the crux of our conversations. How do data engineers handle the domain? Also, is the domain relevant in every case? Well in most cases, yes. Keyword being most.
While data engineering, platformization of the process- data platform and then products on top of it — data product and then experiments and analytics — data science, so the entire data vertical of a business will work even if we don't care about the domain so much but then it will be like using a train to transport one kg of product. Highly Unoptimized.

Suppose you have to build a pipeline that takes the data from different sources and does some processing (adding metadata, company-specific columns, partitioning logic), and then stores this data in the data lake. Now the mind is already running- frequency of data? the pipeline will run daily hourly? what if the pipeline breaks, how do we make the process self-serve, what about backfill? and so on. This is on top of the questions I wrote at the very beginning.
Time to deep dive into the domain as well. Although this has to be done most of the time. The engineers working on a company’s product should know inside and out of the product their team is building at the bare minimum to what the organization is trying to solve and what main business is the company doing.
If the data is of a telecom operator, will the data change after landing? The point is will it receive updates? Most likely no. If a call has happened, it's done. A called B at this time from this latitude and longitude for X minutes and the call was routed from these operators and these cells (showing off telecom domain, ;) ) were involved, etc. Now this data is highly unlikely to change. It's in the past, there can’t be a record updating this record saying B did not talk to A.
Now this makes the storage and processing, and further downstream processing easy. Anyone who will query this data from the datalake is not worried that the data might be stale. On the other hand in the financial world, if A (customer ) pays B (zomato) and later on A’s food does not arrive, and a refund is initiated until the refund is not processed by the source system, the latest state of data is not in the lake. It means the data is mutable and this has to be built into the system inherently by the data engineer, right from collecting the data from the source, the post-processing logic, the choice of storage on the lake, the access patterns from the lake to how the data ends up on some visual dashboards. Either the data team handles this update and shows one final record or if they are just appending data, then every downstream process has to handle multiple records and coalesce to get one final record with the latest update. (Which is a bad design pattern). Now the lake has to support upserts.

Another example is if your company handles sensitive data. Let’s say you are a company that handles healthcare data of hospitals. Now the data is again immutable. Most likely newer entries will be added. But data privacy has to be built into the system from starting. There can be high-end/ VIP customers whose data can only be seen by someone with access. Now a simple RBAC layer might solve this but imagine a newly joined intern in the team just takes a dump of S3 data while doing some POC and he has the most sensitive data lying in his system where he is running a pyspark notebook. Imagine the potential privacy leak. How do we resolve this? At which layer? While it may seem like a data privacy problem, it is much more complex as the domain changes. What if your company works with governments or implements government contracts?
How do you as a data engineer build a platform where you won't even be able to look at prod data once? You will have to automate and abstract everything from moving data, and secrets creation to even obfuscating data while debugging logs. You might also have to build a system where one or two people with appropriate access can come in case of rare incidents. How do you design the data platform now? A wonderful problem to solve.

The crux of this article is to make folks understand the importance of understanding the domain and business problem you are in. As you learn more and more about what the result is, how the result will be used, and what limitations you are bound to, you as the cliche term goes “get out of the box” of only better software design principles, better engineering, test cases and start making the overall business much better.

Closing remarks - Engineers should make extra effort to understand what comes after their influence over some process is over and why this is the way this is. From a personal experience, I remember we were solving a problem involving scale where we misjudged the number of records because of the lack of domain knowledge. The way the financial world works is a single transaction can potentially settle hundreds of thousands of transactions. The probabilistic word is very tricky because the day this happens it will test and might break your design, so it has to be solved. Getting this domain knowledge is one of the hardest things because of a general lack of interest or a casual approach to documentation.

--

--

Harjeet Singh

Problem Solver, writes on Tech, finance and Product. Watch out for my new creation, "THE PM SERIES"