Data warehousing for online retailers: making sense of your sales data

The e-commerce data fragmentation problem

Online retail generates data at every step of the customer journey. A visitor arrives from a paid ad, browses three products, abandons a basket, receives an email, returns two days later and converts. That sequence touches your ad platform, your storefront, your email tool and your payment processor – and each of those systems records its own version of what happened.

When you try to answer a question as straightforward as "which channel drove the most profitable revenue last month," you're immediately confronted with the fact that each platform has a different answer. Google Ads claims credit. Meta claims credit. Shopify shows a total that includes refunds processed after the month ended. Your payment processor settled a slightly different figure because of currency conversion fees. None of them are wrong, exactly – they're just measuring different things.

This is the fragmentation problem. It's not a question of bad data; it's a question of data that lives in separate systems with no common layer to reconcile it. The result is decisions made on whichever number is most convenient, manual spreadsheet reconciliations that take hours and go stale immediately, and a persistent lack of confidence in the figures being reported to leadership.

What data an online retailer generates

Before you can solve the problem, it helps to map out exactly what you're dealing with. A typical online retailer generates meaningful operational data from at least five distinct system types:

Storefront – orders, product data, customer records, discount usage, basket abandonment and session data from platforms like Shopify, WooCommerce or Magento
Ad platforms – spend, impressions, clicks and attributed revenue from Google Ads, Meta Ads, TikTok Ads and email tools like Klaviyo
Fulfilment and warehouse management – stock levels, despatch times, return rates and carrier performance from your WMS or 3PL
Payment processor – actual settlement amounts, failed payments, refunds and fees from Stripe, PayPal or Klarna
CRM – customer lifecycle data, support history, loyalty programme status and contact preferences

Each of these systems is built to serve its own purpose. None of them is built to talk to the others. That's not a product failing – it's simply not what they're designed to do. Connecting them is your responsibility, and a data warehouse is the tool for the job.

Why platform analytics and Shopify reports aren't enough

Shopify's built-in reporting has improved considerably in recent years. For a business in its early stages, it covers the basics: revenue by product, orders by channel, customer acquisition over time. But it has hard limits that become apparent quickly.

Shopify knows what happens on Shopify. It doesn't know what you spent on Google Ads to acquire the customer who placed that order. It doesn't know whether the product in that order was despatched on time or returned within 30 days. It doesn't know the margin on that product after COGS (cost of goods sold) is factored in. It can tell you the revenue; it can't tell you the profit.

Platform analytics from your ad channels have the inverse problem. They're very good at telling you what you spent and what they attribute to their own channel – but that attribution is almost always self-serving. Meta's attribution window and Google's attribution window overlap, which means both platforms claim credit for the same conversion. Add Klaviyo email attribution on top and you can easily arrive at a situation where your total attributed revenue across platforms is 40% higher than your actual revenue.

The only way to cut through this is to work from a single source of data that you control – one that pulls from all of these systems and applies consistent logic to define what counts as a conversion, what counts as revenue and how credit is allocated.

The case for a retail data warehouse

A data warehouse is a centralised repository that pulls data from all of your operational systems, transforms it into a consistent format and makes it available for reporting and analysis. The key word is centralised. Rather than each team working from their own platform's export, everyone works from the same underlying data set.

For online retailers, the practical benefit is immediate. Finance gets a revenue figure that reconciles with the payment processor's settlement data. Marketing gets spend and attributed revenue calculated on terms they've agreed with the business, not terms imposed by each ad platform. Operations gets stock and despatch data alongside the commercial context – which SKUs are driving returns, which are selling at what velocity, where the margin is going.

The warehouse doesn't replace your operational systems. Shopify is still where your orders live; Google Ads is still where you manage your campaigns. The warehouse sits alongside them, ingesting their data and making it possible to ask questions that cross system boundaries.

Key data sources to connect

Building a retail data warehouse starts with deciding which sources to connect and in what order. The most valuable connections for most online retailers are:

Storefront (Shopify, WooCommerce, Magento) – the commercial core. Orders, line items, customers, products, variants, discounts and refunds. This is usually the first source connected because it anchors everything else.
Ad platforms (Google Ads, Meta Ads, TikTok Ads, Klaviyo) – spend data at campaign and ad-set level, joined to order data to calculate true cost per acquisition and return on ad spend.
Fulfilment and WMS – despatch confirmations, return data and stock levels. Essential for understanding the operational cost of serving demand, not just the revenue it generates.
Payment processor (Stripe, PayPal, Klarna) – settlement data provides the ground truth on what was actually received. Particularly important when you offer buy-now-pay-later options that affect your net revenue figures.
CRM – connects order history to the customer lifecycle, enabling segmentation and retention analysis that your storefront alone can't support.

Tools like Fivetran and Airbyte handle the data ingestion layer – maintaining connectors to each source system and loading data into your warehouse on a defined schedule. This removes the need to build and maintain custom integrations for each platform.

Customer lifetime value and segmentation

Customer lifetime value (CLV) – the total revenue a customer is expected to generate over their relationship with your business – is one of the most commercially important metrics in retail. It drives decisions about how much to spend acquiring a new customer, which segments to invest in retaining and where to focus product development.

Shopify's built-in CLV calculations are simplistic. They look at historical spend per customer and apply a basic projection. That's a starting point, not a model. Proper CLV analysis requires connecting order history across time to a single customer identity, applying cohort analysis to understand how purchasing behaviour changes over a customer's lifecycle and segmenting by acquisition channel to understand which channels are delivering customers who actually return.

None of that is possible when your data lives in separate systems. In a warehouse, customer records from the CRM, order history from Shopify and acquisition source from your ad platforms can all be joined on a common customer identifier. You can then build CLV models that reflect the actual behaviour of your customers – and segment them in ways that are actionable for both marketing and retention strategy.

High-value customer segments, at-risk customers who haven't purchased recently, first-order customers who've never returned – all of these become visible and addressable once you have a unified customer view.

Marketing attribution and spend optimisation

Last-click attribution – the default model in most ad platforms – gives full credit for a conversion to the last channel the customer touched before purchasing. It's simple, it's easy to implement and it's systematically wrong for most retail businesses.

The problem is that last-click consistently overvalues bottom-of-funnel channels (branded search, retargeting) and undervalues upper-funnel channels (awareness campaigns, organic social, email). If you optimise spend based on last-click data, you'll keep cutting upper-funnel investment because it doesn't show direct conversion credit – and then wonder why your acquisition costs start rising as you lose the audiences those channels were building.

A data warehouse enables multi-touch attribution modelling. Instead of each platform reporting its own attributed revenue, you work from a single order data set and apply your own attribution logic – linear, time-decay, position-based or a custom model – across the actual touchpoints in the customer journey. The result is a much more accurate picture of what each channel is genuinely contributing and a basis for spend allocation decisions that aren't distorted by platform self-interest.

The blended view also makes it possible to calculate true return on ad spend (ROAS) net of COGS. Combining your product cost data with marketing spend and revenue by channel tells you which channels are generating profitable customers, not just customers – a distinction that matters considerably once you're at scale.

Stock and fulfilment reporting

Revenue and marketing data get most of the attention in retail analytics, but stock and fulfilment data often has more immediate operational impact. Knowing what's selling, at what velocity, with what return rate and at what current stock level is the information that drives purchasing decisions – and getting it wrong is expensive in both directions.

Overstock ties up working capital and typically ends in markdowns. Understock means missed revenue and, increasingly, customer acquisition costs that go to waste when the product the customer wanted isn't available. Neither failure shows up clearly in your storefront's reporting because that reporting doesn't connect to your warehouse system.

A unified data layer connects SKU-level sales velocity from your storefront to current stock levels from your WMS and return rate data from your fulfilment partner. You can then surface the SKUs that are selling quickly and approaching reorder thresholds, the SKUs that have high return rates suggesting a product quality or description issue and the products generating the strongest margin after fulfilment costs are factored in.

For retailers with wholesale as well as direct-to-consumer channels, the picture is even more fragmented without a warehouse – stock available across both channels needs to be visible in one place to avoid overselling or allocation errors.

Route B builds data warehouses for online retailers. Get in touch to discuss your reporting requirements.

Get in Touch