Skip to content
bizurk
← ALL WRITING

2026-05-25 / 16 MIN READ

SQL cohort analysis on Shopify raw data, no GA4 needed

SQL cohort analysis Shopify raw data. CTE patterns, monthly cohort buckets, retention curves at month 1, 3, 6, and 12 you can paste into BigQuery Monday.

A DTC operator asked me on a Monday morning whether the brand's August cohort had stalled. The marketing team had been saying yes for three weeks. The CFO had been saying no. Both were reading GA4. The answer lived in Shopify orders the entire time, and the SQL to find it was about thirty lines long.

This is the cohort retention pattern I copy into every warehouse-first engagement. Monthly cohort buckets, retention curves at month 1, 3, 6, and 12, all computed from Shopify raw orders. The math runs against the orders that actually shipped, grouped by the month they first happened, measured at the horizons that matter for an ad team and a CFO.

It fits underneath the warehouse-first reporting layer at the mart-modeling stage. The SQL below runs against the same staging schema described in BigQuery for Shopify data.

Why cohort retention belongs in SQL, not GA4

GA4 has a cohort exploration tab. I have watched operators stare at it for ten minutes and walk away with a number they could not defend.

Start with what GA4 counts. The cohort exploration uses session-based cohorts by default. A customer who placed an order, came back six weeks later from a different browser without logging in, and placed another order often appears as two distinct cohort members instead of one returning member. The cohort size is wrong on day one and the retention curve compounds the error.

Then there is the time window. GA4 caps standard property data retention at fourteen months. Anything older gets aggregated, sampled, or dropped depending on the report. For a brand that wants twelve-month retention by acquisition month over three years, that ceiling means the question cannot be asked.

Consent mode v2 is the third hole. Events that arrive without consent are modeled by Google's algorithm rather than counted. Modeled numbers are fine for trend lines and useless for cohort math, because the model does not know which specific customer placed the order. A cohort is a set of customers, not a set of estimated users.

Shopify orders do not have any of these problems. Every paid order is a row, and every row has a customer_id, a created_at, and a total_price. A cohort is whatever group of customer_ids you decide to bucket together, and the retention curve is whatever percentage of that bucket placed another order by horizon X. The math is closed-form and the data is yours. The only thing between an operator and a defensible answer is thirty lines of SQL.

Single translucent glass shard isolated on a dark studio backdrop, sharp jagged edge under cold electric-blue rim, faint hot-pink dispersion bleeding across the broken face.
// the shard · jagged edge under twin rim lights

The base CTE for SQL cohort analysis on Shopify raw data

Every cohort retention query I write starts with the same two CTEs. The first computes each customer's first paid order. The second produces one row per (customer, order) with the cohort metadata attached.

WITH customer_first_order AS (
  SELECT
    customer_id,
    MIN(created_at) AS first_order_at,
    DATE_TRUNC(DATE(MIN(created_at)), MONTH) AS cohort_month
  FROM {{ ref('stg_shopify__orders') }}
  WHERE financial_status = 'paid'
    AND customer_id IS NOT NULL
  GROUP BY customer_id
),
customer_orders AS (
  SELECT
    c.customer_id,
    c.cohort_month,
    o.order_id,
    o.created_at AS order_at,
    DATE_DIFF(DATE(o.created_at), DATE(c.first_order_at), MONTH)
      AS months_since_first
  FROM customer_first_order c
  JOIN {{ ref('stg_shopify__orders') }} o
    ON c.customer_id = o.customer_id
    AND o.financial_status = 'paid'
)
SELECT * FROM customer_orders

A few choices in here are not obvious. MIN(created_at) gets the customer's first paid order, which is the only definition of a cohort that survives a replatform. Draft and abandoned orders do not enter the calculation because financial_status = 'paid' filters them out. DATE_TRUNC to month is the right default for any DTC brand doing $2-10M annually; weekly cohorts get noisy below 200 customers per cohort, and monthly is what an operator can hold in their head.

The months_since_first column is the workhorse. It expresses every order as a horizon offset from the customer's first order. Month 0 is the acquisition month itself, month 1 is the next calendar month, month 12 is the twelve-month return horizon. Every retention query below is a COUNTIF against this column.

The shape that comes out of the CTE is one row per (customer, order). For a brand with 50,000 customers and an average of 1.6 orders per customer, that is 80,000 rows. Cheap to query, cheap to materialize.

Monthly cohort buckets, the default I always start with

The first thing the operator asks once the CTE works is how big each cohort actually is. The answer is a one-line query.

SELECT
  cohort_month,
  COUNT(DISTINCT customer_id) AS cohort_size
FROM customer_orders
GROUP BY cohort_month
ORDER BY cohort_month;

What I look at first is the cohort size trend. Cohorts that grew steadily for a year and then dropped 40 percent in October usually mean the brand changed its acquisition channel mix and nobody updated the LTV assumptions in the ad team's spreadsheet. The retention curve will compound that change, so naming it before reading the retention numbers saves a misdiagnosis later.

For most $2-10M Shopify brands, the cohort size column has 24-36 useful months of history and 200-2,000 customers per cohort. Below 200 the retention math gets noisy and I either widen the bucket to a quarter or filter the cohort out of dashboards.

COUNT(DISTINCT customer_id) is the right aggregation here. A naive COUNT(*) will count every order, which is twice the cohort size for any brand with non-trivial repeat behavior. I have seen quarterly board decks ship with cohort sizes that were 2.3x the actual customer count because someone forgot the DISTINCT.

Ultra-wide distant view of a solitary translucent monolith on a dim plain at dusk, vanishing horizon, electric-blue atmospheric haze, single hot-pink rim on the far edge.
// the distant form · monolith against vanishing horizon

Retention curves at month 1, 3, 6, and 12

This is the query that operators actually paste into BigQuery on Monday morning. It produces one row per cohort_month, with cohort size, and the percent of customers who placed at least one additional paid order by month 1, 3, 6, and 12.

WITH cohort_returns AS (
  SELECT
    cohort_month,
    customer_id,
    MAX(CASE WHEN months_since_first BETWEEN 1 AND 1 THEN 1 ELSE 0 END)
      AS returned_by_m1,
    MAX(CASE WHEN months_since_first BETWEEN 1 AND 3 THEN 1 ELSE 0 END)
      AS returned_by_m3,
    MAX(CASE WHEN months_since_first BETWEEN 1 AND 6 THEN 1 ELSE 0 END)
      AS returned_by_m6,
    MAX(CASE WHEN months_since_first BETWEEN 1 AND 12 THEN 1 ELSE 0 END)
      AS returned_by_m12
  FROM customer_orders
  GROUP BY cohort_month, customer_id
)
SELECT
  cohort_month,
  COUNT(*) AS cohort_size,
  ROUND(100.0 * SUM(returned_by_m1) / COUNT(*), 1) AS retention_m1_pct,
  ROUND(100.0 * SUM(returned_by_m3) / COUNT(*), 1) AS retention_m3_pct,
  ROUND(100.0 * SUM(returned_by_m6) / COUNT(*), 1) AS retention_m6_pct,
  ROUND(100.0 * SUM(returned_by_m12) / COUNT(*), 1) AS retention_m12_pct
FROM cohort_returns
WHERE cohort_month >= DATE_SUB(CURRENT_DATE(), INTERVAL 24 MONTH)
  AND cohort_month <= DATE_SUB(CURRENT_DATE(), INTERVAL 1 MONTH)
GROUP BY cohort_month
ORDER BY cohort_month;

The BETWEEN 1 AND N pattern is what makes this readable. A customer "returned by month 6" if any of their orders fell between month 1 and month 6 since acquisition. The MAX(CASE ...) collapses multiple repeat orders to a single 1-or-0 per customer, which is what retention math needs.

The lower bound of cohort_month >= 24 months ago is what most operators want for a Monday-standup view. The upper bound of cohort_month <= 1 month ago is the part most teams forget. Including the current calendar month in retention dashboards is misleading because that cohort has had less than a month to repeat. Cutting the youngest cohort off the dashboard is what makes the numbers comparable.

A typical mid-market DTC brand will produce something like this:

| cohort_month | size | m1 | m3 | m6 | m12 | |---|---|---|---|---|---| | 2024-08 | 1,420 | 12.3% | 24.1% | 31.7% | 42.4% | | 2024-09 | 1,510 | 13.1% | 25.4% | 33.0% | 43.8% | | 2024-10 | 1,680 | 11.7% | 23.0% | 30.5% | 41.2% | | 2024-11 | 2,140 | 9.8% | 19.5% | 26.4% | 36.9% |

The November cohort is the operator's question. Why did m1 drop 350 basis points from September? Two answers usually apply: November traffic was BFCM-driven (lower-intent first orders) or the brand ran a promo that pulled forward demand without converting it to repeat behavior. Either way, the SQL flags it. The interpretation is the conversation.

The retention curve is the only retention number worth reporting up. Single-number retention averages everything together and hides the cohort that is actually dying.

The Monday morning query operators actually run

Most operators do not want to read a 24-row table every Monday. They want one alert and one comparison. The wrapper query I usually add on top of the retention curve looks like this.

WITH retention AS (
  -- the m1/m3/m6/m12 retention query from the previous section
),
benchmark AS (
  SELECT
    AVG(retention_m3_pct) AS avg_m3_last_12mo
  FROM retention
  WHERE cohort_month >= DATE_SUB(CURRENT_DATE(), INTERVAL 13 MONTH)
    AND cohort_month <= DATE_SUB(CURRENT_DATE(), INTERVAL 4 MONTH)
)
SELECT
  r.cohort_month,
  r.cohort_size,
  r.retention_m3_pct,
  b.avg_m3_last_12mo,
  ROUND(r.retention_m3_pct - b.avg_m3_last_12mo, 1) AS delta_vs_benchmark
FROM retention r
CROSS JOIN benchmark b
WHERE r.cohort_month >= DATE_SUB(CURRENT_DATE(), INTERVAL 6 MONTH)
ORDER BY r.cohort_month DESC;

The output is six rows, each with a delta_vs_benchmark column. Anything more than 2 percentage points below the trailing average is worth investigating. Anything more than 4 points below is the conversation that needs to happen before the marketing team finalizes next month's spend. Within 1 point of the trailing average is noise.

I have shipped this exact pattern at multiple DTC brands. The shape of the output never changes. What changes is the conversation it produces in the room.

Macro close-up of crystalline dispersion on a single glass slab, refractive bands of cold blue splitting into hot-pink across the chipped corner, fine surface detail.
// the dispersion · refractive bands close up

Turning the query into a dbt mart model

The Monday morning query is not the right thing to run live every time the dashboard refreshes. It scans the orders table, computes per-customer aggregations, and joins twice. At a $5M brand with three years of data, that is roughly 200,000 customer-order rows, which BigQuery will scan in a few seconds and bill at maybe a quarter of a cent. At a $50M brand with eight years of data, the same query is a much bigger scan, and dashboard refresh times become noticeable.

The right answer is a dbt mart model that materializes once per day and partitions by cohort_month. Dashboards then read the mart directly.

-- models/marts/mart_cohort_retention.sql
{{
  config(
    materialized='table',
    partition_by={'field': 'cohort_month', 'data_type': 'date'},
    cluster_by=['cohort_month']
  )
}}

WITH customer_first_order AS ( /* base CTE */ ),
customer_orders AS ( /* base CTE */ ),
cohort_returns AS ( /* per-customer 1-or-0 horizons */ )

SELECT
  cohort_month,
  COUNT(*) AS cohort_size,
  ROUND(100.0 * SUM(returned_by_m1) / COUNT(*), 1) AS retention_m1_pct,
  ROUND(100.0 * SUM(returned_by_m3) / COUNT(*), 1) AS retention_m3_pct,
  ROUND(100.0 * SUM(returned_by_m6) / COUNT(*), 1) AS retention_m6_pct,
  ROUND(100.0 * SUM(returned_by_m12) / COUNT(*), 1) AS retention_m12_pct,
  CURRENT_TIMESTAMP() AS _refreshed_at
FROM cohort_returns
GROUP BY cohort_month;

Daily refresh is enough. Cohort retention does not change minute to minute, and overnight is the cadence that matches how operators actually use the data. The partition on cohort_month makes any "show me the last 12 cohorts" query scan only 12 partitions instead of the whole table.

The mart is roughly 36 rows for a three-year-old brand. Dashboards read it instantly. The cost to refresh once per day is a few cents. The same shape underwrites the cohort LTV mart which lives one model over and follows the same partitioning rule.

Ultra-wide backlit silhouette of a single tall translucent slab against a deep electric-blue evening sky, hot-pink magic-hour band along the low horizon.
// the silhouette · slab against evening sky

Where the SQL breaks first

Four things go wrong in the first month after this pattern ships, and all of them have known fixes.

Customer_id mismatches. Shopify's customer_id is its own identifier. If the brand also has Klaviyo, Recharge, or a custom identity layer, those systems carry their own customer keys. A cohort defined on Shopify's id is a Shopify cohort, not a brand cohort. For retention reporting that is usually fine. For attribution that crosses platforms, identity reconciliation has to happen at the staging layer first.

Refunds counted as orders. A customer who placed an order, kept it, then placed a refunded order will count as a "repeat" in this SQL because both orders have financial_status = 'paid' at the moment of the order. The fix is to filter on the post-refund state at query time, or to include a total_refunded field in the staging model and exclude orders where total_refunded >= total_price. I usually include the field upstream so every retention query inherits the fix.

Test orders polluting the earliest cohort. When a brand first installs Shopify, the team places test orders to verify checkout. Those test orders have real customer_id values for the staff accounts. The earliest cohort is therefore inflated and shows artificially high retention because the same staff customers keep placing test orders. The fix is a WHERE clause that excludes known internal email domains, or a tag-based filter ('internal' not in tags).

Subscription rebills as net-new repeats. If the brand uses Recharge or Bold or any subscription stack, the rebill orders show up in the orders table as additional paid orders by the same customer. The customer "returned" at month 1, 2, 3, 4. Every month. That is not retention in the operator sense; it is autopilot. For brands with a meaningful subscription mix, I either build a separate mart_subscription_cohort_retention model on subscription_contract data, or I add a is_subscription_rebill flag at the staging layer and filter it out of the discrete-purchase retention math.

FAQ

Do I need a warehouse to run this, or can I do it directly in Shopify reports?

You need a warehouse. Shopify's built-in cohort report is session-based and does not let you write the SQL above. The cheapest viable setup is BigQuery on-demand pricing with a daily Shopify export; that runs about $20-50 per month for a $5M brand. The full economics are in the warehouse setup post for budget DTC stores.

Why monthly cohorts and not weekly?

At $2-10M annual revenue, weekly cohorts run 50-200 customers each, which is below the threshold where retention math is statistically stable. Monthly cohorts at the same revenue band run 200-2,000 customers, which is enough signal. Weekly makes sense at $20M+ when each week has 1,000+ first-time customers.

Should I include or exclude promo-driven cohorts?

Include them, but tag them. Add a cohort_acquisition_promo column at the staging layer that flags whether the customer's first order used a promo code. The retention curve for promo-acquired cohorts is usually 30-50 percent lower than full-price-acquired cohorts. Reporting them blended hides the gap and the ad team will spend more on promo-driven acquisition than the math justifies.

What about retention by acquisition channel?

Same shape, with acquisition_channel added to the GROUP BY. The query gets wider but the logic does not change. Channel-level retention is where the conversation about ad spend mix actually happens; pair it with the attribution modeling SQL for the full picture.

How do I handle a customer who has multiple emails or accounts?

Identity reconciliation upstream of the cohort model. Most brands solve this with email-hash matching at the staging layer, which collapses multiple Shopify customer rows that share an email into a single canonical customer. Until that staging model exists, every retention number has a known but uncharacterized error bar.

Can I run this against a Postgres or Snowflake warehouse instead of BigQuery?

Yes. The SQL above is mostly portable. Postgres uses DATE_TRUNC('month', ...) instead of BigQuery's DATE_TRUNC(date, MONTH). Snowflake uses DATEADD instead of DATE_SUB. Snowflake also uses COUNT_IF natively, which is slightly cleaner than the MAX(CASE WHEN ...) pattern. The shape of the answer is identical.

What to try this week

Run the base CTE against your warehouse. Confirm the cohort sizes match what your finance team has for monthly first-time-customer counts. Then run the m1/m3/m6/m12 retention query and look at the last six rows. If the trailing-month delta column shows anything more than 2 points below the 12-month average, that cohort is the question worth bringing to next Monday's standup.

If the SQL breaks on customer_id mismatches or refunds, the deeper blocker is the staging layer. The DTC stack audit scopes the staging-layer rebuild that makes cohort retention numbers actually defendable. I've shipped the same pattern as part of a multi-service data pipeline in Q1 2026 for a mid-market DTC operator.

Sources and specifics

  • The base CTE pattern runs against a standard dbt-shopify staging model; see BigQuery for Shopify data for the schema fields the queries above expect.
  • Retention rate ranges (m1 around 10-15 percent, m12 around 35-45 percent) are typical for mid-market DTC Shopify brands; subscription and premium brands trend higher, impulse-commodity brands trend lower.
  • The four common breakage modes (customer_id mismatch, refunds, test orders, subscription rebills) were observed across multiple DTC engagements; fix patterns live at the staging layer.
  • BigQuery on-demand pricing of $6.25 per TB scanned is current as of April 2026; the mart-model pattern keeps a typical brand's cohort retention queries under one cent per dashboard refresh.
  • A Q1 2026 analytics engine rebuild for a mid-market DTC operator runs the multi-service pipeline (separate ingestion, warehouse, API, and UI layers) where this cohort retention pattern is the core mart model.

// related

Let us talk

If something in here connected, feel free to reach out. No pitch deck, no intake form. Just a direct conversation.

>Get in touch

Tell me what you’re trying to ship.

Send a quick message and I read it within a day, or talk to AI Michael first if you want to feel out your project before you write to me.