Sumidatasumidata.io
Sign in
Docs/Guides/Historical backfill
Guides · Conversions · Historical backfill

Historical backfill

Replay past orders — and, optionally, past sessions with their UTMs — so reports start with full revenue context from day one. Events keep their original timestamps; first-time-buyer flags stay consistent across the replay.

01How backfill works

Two endpoints cover the whole import — one for sessions, one for events. Both are idempotent.

Sumidata's ingest API is the same surface the browser SDK uses — there is no separate bulk-import pipeline. Backfill means replaying historical rows through two endpoints with caller-supplied UUIDs:

  • POST /sdk/sessions/importoptional. Reconstructs a historical session with its UTM snapshot, so conversions that reference it inherit the right attribution.
  • POST /sdk/ingest with source: 'import' — the same endpoint the SDK uses, with relaxed validation that makes sessionId and replayId optional.
i
Dedup behaviour is load-bearing for safe replays. Ingest checks dedup_keys before every conversion — a repeat of (orderId, productId) returns 400 with Duplicate conversion: … and rejects the whole batch. This makes retries safe but also means you can not update an already-ingested conversion by re-sending it. See Updating a backfilled order.

02Timestamps

Always send timestamp — never let ingest-time stand in for event-time.

When you replay historical orders, the time the request reaches Sumidata is not the time the sale happened. If you omit timestamp, a year of orders collapses into today and every funnel, cohort, and retention chart breaks.

Send timestamp as Unix milliseconds matching when the business event actually occurred — orders.paid_at, invoice.settled_at, whatever your system of record stores. No min/max is enforced by the server; timestamps decades in the past or minutes in the future are both accepted, but reports are capped by project retention (10 years for events, 1 year for dedup_keys).

timestamp.ts
const event = {
  name: 'purchase',
  orderId: order.id,
  totalAmount: order.total,
  timestamp: order.paidAt.getTime(),  // original event time, in ms
}
!
dedup_keys has a 1-year TTL. A conversion whose createdAt falls outside that window will not collide with a replay of the same (orderId, productId). For the common case of a one-shot first-integration backfill this doesn't matter; for periodic reconciliation runs, do not rely on dedup to block duplicates older than a year.

03Importing historical sessions

Reconstruct the session the order happened in, so UTMs and landing-page attribution attach correctly.

Conversions inherit UTM attribution by joining sessionId or falling back to device/user attribution tables. If you backfill only events with no prior session, the conversions land without attribution — the AI Analyst can still count revenue, but campaign reports will be empty for the backfilled period.

To keep attribution honest, import the sessions first. Each call creates the session row and a corresponding device_attribution entry if any UTM parameter is present.

POST

https://api.sumidata.io/sdk/sessions/importPOST /sdk/sessions/import

Create a session row with a caller-supplied sessionId. Returns 400 if the session already exists — safe idempotency without silent upsert.

Request body

NameTypeRequiredDescription
projectIdUUIDyesYour Sumidata project ID
sessionIdUUIDyesCaller-supplied UUID. Re-use this ID on the matching POST /sdk/ingest call so events link correctly
deviceIdUUIDyesYour stable device identifier. Use the zero-UUID 00000000-0000-0000-0000-000000000000 if you don't track devices historically
startedAtnumber | ISOWhen the session actually happened — Unix ms or ISO-8601. Defaults to now if omitted (almost always wrong for backfill)
externalUserIdstringStable user identifier. Lets the session tie to user-level attribution immediately
utmSource, utmMedium, utmCampaign, utmTerm, utmContentstringUTM snapshot from the original landing URL. Any set value creates a device_attribution row
referrer, landingPagestringOriginal referrer / landing URL
userAgent, screenSize, platform, browserstringOptional device context — populate if you logged it at the time

Example request

import-session.ts
import { randomUUID } from 'crypto'

const sessionId = randomUUID()

await axios.post('https://api.sumidata.io/sdk/sessions/import', {
  projectId: process.env.SUMIDATA_PROJECT_ID,
  sessionId,
  deviceId: order.deviceId ?? '00000000-0000-0000-0000-000000000000',
  externalUserId: order.userId,
  startedAt: order.sessionStartedAt.getTime(),
  utmSource: order.utmSource,
  utmMedium: order.utmMedium,
  utmCampaign: order.utmCampaign,
  referrer: order.referrer,
  landingPage: order.landingPage,
})

Success · 200 OK

response.json
{ "sessionId": "…", "status": "imported" }

Errors

400sessionId required / invalid UUIDMissing or malformed
400Session <id> already existsDedup — the sessionId is in use. Generate a fresh UUID or skip this row
400startedAt invalidNot a number or ISO string, or failed Date parsing
i
If you don't have historical UTMs at all, skip this step — send events straight through /sdk/ingest with source: 'import' and no sessionId. Conversions fall back to no attribution and isPrimarySale flags still work — you just won't get UTM splits for the backfilled period.

04Importing historical orders

Replay orders in chronological order, carrying the session ID if you imported one.

Expose an admin-only endpoint that iterates your order table and posts each row to /sdk/ingest with source: 'import'. Process orders chronologically — this is essential for correct isPrimarySale flags (see section 05).

admin/sync.controller.ts
@Get('/admin/sumidata/sync')
async sync(@Query() q: { pwd: string, startDate: string, endDate: string }) {
  if (q.pwd !== process.env.ADMIN_PASSWORD) throw new UnauthorizedException()

  const from = new Date(q.startDate + 'T00:00:00Z').getTime()
  const to = new Date(q.endDate + 'T23:59:59Z').getTime()

  return this.syncService.run({ from, to })
}
!
Backfill triggers write traffic — treat the trigger endpoint like any other privileged admin action. Password-protect it, rate-limit by IP, and log every invocation with the caller and the date range. Re-running the same range is safe because of dedup, but an accidental full-history replay from a public endpoint is still an easy way to burn money and minutes.

05Keeping isPrimarySale honest

Track which users got a 'first purchase' flag during this run — don't compute it against the live DB row-by-row.

If you evaluate "is this the user's first purchase?" against the live DB for every replayed order, every order in the range looks like a repeat (because all orders are already in the DB). Fix this with two sets:

  • Prior purchasers — users who had a purchase before the replay range starts. Query once at the beginning.
  • Purchasers in this run — users who've had their first order flagged during the current replay. Track in-memory as you iterate chronologically.

A user is a primary sale if they're in neither set when their earliest in-range order is processed.

sync.service.ts
async run({ from, to }: { from: number, to: number }) {
  const priorPurchasers = await orders.distinctUserIds({ paidBefore: from })
  const purchasersInSync = new Set<string>()
  const errors: any[] = []
  let sent = 0

  const rows = await orders.find({ paidAt: { between: [from, to] }, orderBy: 'paidAt ASC' })

  for (const order of rows) {
    const isPrimarySale =
      !priorPurchasers.has(order.userId) && !purchasersInSync.has(order.userId)

    if (isPrimarySale) purchasersInSync.add(order.userId)

    try {
      await ingest({
        projectId: process.env.SUMIDATA_PROJECT_ID,
        deviceId: '00000000-0000-0000-0000-000000000000',
        externalUserId: order.userId,
        source: 'import',
        events: [{
          name: 'purchase',
          orderId: order.id,
          totalAmount: order.total,
          timestamp: order.paidAt.getTime(),
          isPrimarySale,
        }]
      })
      sent++
    } catch (e) {
      errors.push({ orderId: order.id, error: String(e) })
    }
  }

  return { totalProcessed: rows.length, sent, errors: errors.slice(0, 20) }
}

06Pre-flight checklist

Five checks before you pull the trigger on a large replay.

1

Dry-run against a staging project first

Create a separate Sumidata project, point your sync endpoint at it via SUMIDATA_PROJECT_ID, and replay a week's worth of orders. Verify revenue totals, isPrimarySale counts, and UTM breakdowns in the dashboard against your source system. Only then flip to production.

2

Pre-fetch enrichment data

For large ranges, avoid N+1 lookups. Collect the set of unique user IDs, product IDs, and coupon codes in the range, load them with one query each, then join in-memory as you iterate.

prefetch.ts
const userIds = [...new Set(rows.map((r) => r.userId))]
const productIds = [...new Set(rows.map((r) => r.productId))]

const [usersMap, productsMap] = await Promise.all([
  users.findMapByIds(userIds),
  products.findMapByIds(productIds),
])
3

Keep batches small, add a short pause

Send 1 event per envelope while iterating chronologically, or batch no more than ~50 events per envelope if you're OK with a batch-level retry on dedup collisions (a single duplicate aborts the whole batch). Insert a short pause every few hundred requests — a few hundred req/sec per project is the comfortable ceiling.

4

Handle errors per-event, not per-batch

Wrap each ingest in try/catch. On 4xx, log and move on (dedup, validation). On 5xx or network failures, requeue with backoff. Keep the first ~20 error messages in the response body for quick debugging, and a dead-letter queue for the stragglers.

5

Verify after the run

Hit GET /api/partner/health to confirm the events landed. Then run a sample query: count events by source for the backfilled period — it should match your source system exactly. Revenue totals by day are a fast second check.

verify.sql
SELECT toDate(timestamp) AS day,
       count()                 AS conversions,
       sum(totalAmount)         AS revenue
FROM events
WHERE source = 'import' AND orderId != ''
  AND timestamp BETWEEN '2025-01-01' AND '2025-12-31'
GROUP BY day ORDER BY day

07Updating a backfilled order

Re-sending a conversion with a changed field is rejected by dedup. Here is what to do instead.

The dedup key (orderId, productId) is persistent for a year. You cannot overwrite a stored conversion by re-sending the same orderId with different fields — the retry returns 400 Duplicate conversion: ….

Two options, depending on why you need the update:

  • You added a new field to your schema (e.g. you started tracking couponCode): for future orders, just start sending it. For history, the old events stay without the field — this is normal and expected, and reports handle missing values gracefully.
  • A field was wrong (e.g. totals miscalculated in a historical export): contact support to purge the affected orderIds from events and dedup_keys, then re-run backfill for that range. The project-wide TTL prevents you from doing this in self-serve today.

08AI-agent quick reference

Hand this checklist to an agent running an autonomous import.

backfill.yaml
goal: Replay historical orders into Sumidata, preserving timestamps and first-purchase flags.

endpoints:
  sessions:
    path: POST /sdk/sessions/import
    use_when: you have historical UTMs to preserve
    idempotency: caller-supplied sessionId; 400 if it exists
  events:
    path: POST /sdk/ingest
    body: { projectId, deviceId, externalUserId, source: "import", events: [...] }
    idempotency: (orderId, productId); 400 on duplicate aborts whole batch

per_event_required: [name, timestamp (ms)]
per_conversion_required: [orderId, totalAmount]
per_conversion_recommended: [currency, isPrimarySale, productId]

ordering: process orders chronologically — required for correct isPrimarySale
primary_sale_rule: |
  isPrimarySale = user NOT in priorPurchasers AND NOT in purchasersInCurrentRun
  priorPurchasers   = distinct userIds with a paid order before range.start
  purchasersInCurrentRun = in-memory set, add userId when flagging primary

pacing:
  batch_size: 1–50 events per envelope (single-event envelope is the safest default)
  throughput: few hundred requests/sec per project; pause 50–100ms every ~500 requests
  retry: 4xx = log and skip; 5xx / network = exponential backoff; dead-letter after N retries

verify:
  - GET /api/partner/health — last-event timestamp updated
  - SQL: COUNT(*) and SUM(totalAmount) by day WHERE source='import' matches source DB
  - spot-check isPrimarySale by user — only one TRUE per user across the replay