The balancing act: Data Management and Data Consumption

By Devanathan Rajagopalan
Published on December 12, 2024

Council Posts

Data Architects have a tougher challenge now than ever before.

In the early 2000s, data warehousing gained traction as businesses sought to consolidate data from various sources to support decision-making. The data architecture had two schools of thought: the ‘Normalized Data Warehouse’ which prioritized data integrity and flexibility; and the ‘Deformalized Data Marts’ that prioritized consumption.

Given analytics systems were batch oriented and largely read oriented, building for consumption was the more efficient model and was more preferred. As the data sources and data needs have evolved, this has changed. With frequent data ingestion and processing, the balance between ‘Data Management’ and ‘Data Consumption’ has become a challenge. Today, exploring its complexities is crucial for data architects and engineers navigating this intricate landscape.

Data Access Scenarios and Consumption Patterns

Taking the retail industry as an example, transactional data like orders, invoices, and events often requires analysis across geographies, products, and markets. Data is generally rolled up and reported upon. As the business sees areas of concern, they need the ability to drill down and slice and dice the data by different dimensions.

These dimensions can be denormalized into a Star Schema with minimal risk due to their infrequent changes. Since dimensions are relatively small, they can be joined with transactional data on the fly during consumption. While OLAP cubes can support executive reporting, slicing, dicing and joining the data can also be performed dynamically at the point of consumption.

A data warehouse brings data from multiple domains together to support enterprise metrics, like inventory turnover which compares inventory and sales. Such metrics, crucial at the executive level, cannot be computed at runtime as blending data is challenging. Inventory systems provide end-of-day snapshots, while sales systems publish events in real-time, making immediate integration impractical.

In this case, a solution is a unified fact table where transactions share common attributes such as date/time, customer, vendor etc. Using a consistent structure allows data to be combined via efficient Union All operations, which process datasets in parallel, avoiding complex joins.

Event_Sales_Fact

Date/Time of Sales	Customer Id	Product Id	Location Id	Metric Name	Metric Value
10-11-24	123	1	500	Invoice Quantity	2
10-11-24	123	1	500	Invoice Amount	$200
11-11-24	125	10	550	Invoice Quantity	3
11-11-24	125	10	550	Invoice Amount	$150

Event_Inventory_Fact

Date	Customer Id	Product Id	Location Id	Metric Name	Metric Value
10-11-24	NA	1	500	Inventory Units	10
10-11-24	NA	10	550	Inventory Units	7
11-11-24	NA	1	500	Inventory Units	8
11-11-24	NA	10	550	Inventory Units	4

To get Inventory by sales for a given date, we can Union the 2 facts and sum (inventory units)/sum (invoice quantity). This way these 2 data can be blended, and performance will be optimal.

While most dimensions are common, we can see some don’t apply to all events. While Sales or Browse events have a customer, inventory doesn’t. But to maintain uniformity we can always have a dummy dimension linkage to allows data representation to be uniform without hurting the data integrity.

Is there a storage overhead with this approach? Normally the metrics would be columns in a fact table. Here the rows are repeated for each metric. There should be lot more storage or repetition of data. As technology has advanced, all data storage platforms have data compression options. And even though the dimensions are repeated, they compress very efficiently. While storage is cheaper and compute escalates costs, compression techniques can still make storage more efficient.

Event Lifecycles

In retail, the demand to analyze the impact of events across the business lifecycle through cross-domain analysis is pertinent. For instance, a delayed vendor product affects replenishment and sales, while a port strike could impact logistics and multiple other business processes. To assess these impacts, businesses need to blend data across transactional systems, often requiring real-time analysis.

Can we blend this data during data management processes? Not easily, as data from various systems often arrives with delays and at different times. Blending it during processing is challenging without delaying visibility. Real-time visibility is crucial, so cross-domain dependencies cannot typically be added here.

But stakeholders need the data to be linked together and delivered for analytics. For these cases newer technology options like Graph data structures could be helpful. Just like applications have evolved with GraphQL, data structures are also evolving with Graph databases. Representing data as a graph ensures the linkages are maintained and at an object level change could find related objects and the impact. So detailed business planning could be done.

The architecture of Graph DBs allow for indexing of nodes and efficiency in traversal. It is not ideal for aggregate analytics. But when the analytics is event level, graph databases allow the traversal to be more efficient.

Finding the Sweet Spot

Data Architects have a tougher challenge now than ever before. The pressure on both data consumption and data management have increased tremendously. And add to this the amount of data generated has increased to Zettabytes across the industry. While designing solutions future proofing them is all the trickier now as the future evolves at a breathtaking speed.

Striking the right balance between data management and data consumption is paramount. While there is no one-size-fits-all solution, some general guidelines emerge:

Denormalize dimensions. The rate of data change is slower and consumption by hierarchy is very frequent and hence it is worth doing this.

Look at different data architectures where metrics across domains could be blended without joins. A simple uniform structure of data could work in most cases.
Think about data as a graph and not just objects. That way the cross impacts of events can be handled better.

These simple steps allow data management to be resilient and performant. By carefully evaluating their specific requirements and consumption patterns, organizations can navigate this intricate landscape and unlock the full potential of their data assets.

Devanathan Rajagopalan

Devanathan is the Principal Data Engineer of Data Analytics and Insights, Platforms at Lowe’s. As a part of the technology leadership team at Lowe’s, Deva is responsible for the overall Data Platform Architecture, Strategy and building and guiding the Engineering frameworks on Data Engineering, ML and Analytics.