Reconciliation at Scale — Part 2

Harjeet Singh
6 min readSep 16, 2023
Image source google

I started with a poem in the previous article on reconciliation, if you haven’t read part 1 or want another look at the poem, please find it here.

“Data, Data, everywhere,
Not a single record should be missed;
Data, Data, everywhere,
All the stakeholders will be pissed.”
- Harjeet Singh

So like the image and poem convey, we can't compare apples and oranges and say everything’s alright, and it's most likely non-negotiable for any engineer to miss data points. In the previous article, we discussed why reconciliation is essential and how we can perform it at scale. Let's continue with it.

We were left with a spark dataset/dataframe diff_union that contains two rows for each unique id that might differ on some columns. Now at this stage, we can not say whether the old Datasource (A) or the new Datasource (B) is bad, we need a source of truth (the Master) because the differences can occur due to a variety of reasons (as discussed in previous article ) which are-

-> Representation of Null or empty values.
-> Schema changes in Old and New Data sources (which can be due to a. business requirements)
-> The new Data Source is running at a much better latency, so at any given point t, it will have more data than the old data source which will lead to differences.
-> Arbitrary change or corruption in data, which needs…

--

--

Harjeet Singh

Problem Solver, writes on Tech, finance and Product. Watch out for my new creation, "THE PM SERIES"