Talk details
Row level lineage at Carbonfact
One of Carbonfact's activities is to produce yearly environmental reports for our customers. These reports get audited by major consulting firms. The due diligence requires to understand exactly where any data point comes from. This may be tricky, because our customers have many files scattered across their IT landscape. We have developed a row-level data lineage system, in Python, which allows us to quickly answer such requests. This also allows us to compile data quality reports, by indicating how many data point come from primary data sources, vs. heuristics and machine learning. We developed a small module in house because we didn't find anything simple that suited our needs. Now we want to share our learnings! This presentation details the technical architecture, challenges encountered, and solutions developed to precisely trace the origin of each data point in complex environmental reports.

