Replay past orders — and, optionally, past sessions with their UTMs — so reports start with full revenue context from day one. Events keep their original timestamps; first-time-buyer flags stay consistent across the replay.
01How backfill works
Two endpoints cover the whole import — one for sessions, one for events. Both are idempotent.
Sumidata's ingest API is the same surface the browser SDK uses — there is no separate bulk-import pipeline. Backfill means replaying historical rows through two endpoints with caller-supplied UUIDs:
POST /sdk/sessions/import — optional. Reconstructs a historical session with its UTM snapshot, so conversions that reference it inherit the right attribution.
POST /sdk/ingest with source: 'import' — the same endpoint the SDK uses, with relaxed validation that makes sessionId and replayId optional.
i
Dedup behaviour is load-bearing for safe replays. Ingest checks dedup_keys before every conversion — a repeat of (orderId, productId) returns 400 with Duplicate conversion: … and rejects the whole batch. This makes retries safe but also means you can not update an already-ingested conversion by re-sending it. See Updating a backfilled order.
02Timestamps
Always send timestamp — never let ingest-time stand in for event-time.
When you replay historical orders, the time the request reaches Sumidata is not the time the sale happened. If you omit timestamp, a year of orders collapses into today and every funnel, cohort, and retention chart breaks.
Send timestamp as Unix milliseconds matching when the business event actually occurred — orders.paid_at, invoice.settled_at, whatever your system of record stores. No min/max is enforced by the server; timestamps decades in the past or minutes in the future are both accepted, but reports are capped by project retention (10 years for events, 1 year for dedup_keys).
timestamp.ts
const event = {
name: 'purchase',
orderId: order.id,
totalAmount: order.total,
timestamp: order.paidAt.getTime(), // original event time, in ms
}
!
dedup_keys has a 1-year TTL. A conversion whose createdAt falls outside that window will not collide with a replay of the same (orderId, productId). For the common case of a one-shot first-integration backfill this doesn't matter; for periodic reconciliation runs, do not rely on dedup to block duplicates older than a year.
03Importing historical sessions
Reconstruct the session the order happened in, so UTMs and landing-page attribution attach correctly.
Conversions inherit UTM attribution by joining sessionId or falling back to device/user attribution tables. If you backfill only events with no prior session, the conversions land without attribution — the AI Analyst can still count revenue, but campaign reports will be empty for the backfilled period.
To keep attribution honest, import the sessions first. Each call creates the session row and a corresponding device_attribution entry if any UTM parameter is present.
Dedup — the sessionId is in use. Generate a fresh UUID or skip this row
400
startedAt invalid
Not a number or ISO string, or failed Date parsing
i
If you don't have historical UTMs at all, skip this step — send events straight through /sdk/ingest with source: 'import' and no sessionId. Conversions fall back to no attribution and isPrimarySale flags still work — you just won't get UTM splits for the backfilled period.
04Importing historical orders
Replay orders in chronological order, carrying the session ID if you imported one.
Expose an admin-only endpoint that iterates your order table and posts each row to /sdk/ingest with source: 'import'. Process orders chronologically — this is essential for correct isPrimarySale flags (see section 05).
admin/sync.controller.ts
@Get('/admin/sumidata/sync')
asyncsync(@Query() q: { pwd: string, startDate: string, endDate: string }) {
if (q.pwd !== process.env.ADMIN_PASSWORD) throw newUnauthorizedException()
const from = newDate(q.startDate + 'T00:00:00Z').getTime()
const to = newDate(q.endDate + 'T23:59:59Z').getTime()
returnthis.syncService.run({ from, to })
}
!
Backfill triggers write traffic — treat the trigger endpoint like any other privileged admin action. Password-protect it, rate-limit by IP, and log every invocation with the caller and the date range. Re-running the same range is safe because of dedup, but an accidental full-history replay from a public endpoint is still an easy way to burn money and minutes.
05Keeping isPrimarySale honest
Track which users got a 'first purchase' flag during this run — don't compute it against the live DB row-by-row.
If you evaluate "is this the user's first purchase?" against the live DB for every replayed order, every order in the range looks like a repeat (because all orders are already in the DB). Fix this with two sets:
Prior purchasers — users who had a purchase before the replay range starts. Query once at the beginning.
Purchasers in this run — users who've had their first order flagged during the current replay. Track in-memory as you iterate chronologically.
A user is a primary sale if they're in neither set when their earliest in-range order is processed.
Five checks before you pull the trigger on a large replay.
1
Dry-run against a staging project first
Create a separate Sumidata project, point your sync endpoint at it via SUMIDATA_PROJECT_ID, and replay a week's worth of orders. Verify revenue totals, isPrimarySale counts, and UTM breakdowns in the dashboard against your source system. Only then flip to production.
2
Pre-fetch enrichment data
For large ranges, avoid N+1 lookups. Collect the set of unique user IDs, product IDs, and coupon codes in the range, load them with one query each, then join in-memory as you iterate.
Send 1 event per envelope while iterating chronologically, or batch no more than ~50 events per envelope if you're OK with a batch-level retry on dedup collisions (a single duplicate aborts the whole batch). Insert a short pause every few hundred requests — a few hundred req/sec per project is the comfortable ceiling.
4
Handle errors per-event, not per-batch
Wrap each ingest in try/catch. On 4xx, log and move on (dedup, validation). On 5xx or network failures, requeue with backoff. Keep the first ~20 error messages in the response body for quick debugging, and a dead-letter queue for the stragglers.
5
Verify after the run
Hit GET /api/partner/health to confirm the events landed. Then run a sample query: count events by source for the backfilled period — it should match your source system exactly. Revenue totals by day are a fast second check.
verify.sql
SELECTtoDate(timestamp) AS day,
count() AS conversions,
sum(totalAmount) AS revenue
FROM events
WHERE source = 'import'AND orderId != ''AND timestamp BETWEEN'2025-01-01'AND'2025-12-31'GROUP BY day ORDER BY day
07Updating a backfilled order
Re-sending a conversion with a changed field is rejected by dedup. Here is what to do instead.
The dedup key (orderId, productId) is persistent for a year. You cannot overwrite a stored conversion by re-sending the same orderId with different fields — the retry returns 400Duplicate conversion: ….
Two options, depending on why you need the update:
You added a new field to your schema (e.g. you started tracking couponCode): for future orders, just start sending it. For history, the old events stay without the field — this is normal and expected, and reports handle missing values gracefully.
A field was wrong (e.g. totals miscalculated in a historical export): contact support to purge the affected orderIds from events and dedup_keys, then re-run backfill for that range. The project-wide TTL prevents you from doing this in self-serve today.
08AI-agent quick reference
Hand this checklist to an agent running an autonomous import.
backfill.yaml
goal: Replay historical orders into Sumidata, preserving timestamps and first-purchase flags.endpoints:
sessions:
path: POST /sdk/sessions/importuse_when: you have historical UTMs to preserveidempotency: caller-supplied sessionId; 400 if it existsevents:
path: POST /sdk/ingestbody: { projectId, deviceId, externalUserId, source: "import", events: [...] }idempotency: (orderId, productId); 400 on duplicate aborts whole batchper_event_required: [name, timestamp (ms)]per_conversion_required: [orderId, totalAmount]per_conversion_recommended: [currency, isPrimarySale, productId]ordering: process orders chronologically — required for correct isPrimarySaleprimary_sale_rule: |
isPrimarySale = user NOT in priorPurchasers AND NOT in purchasersInCurrentRunpriorPurchasers = distinct userIds with a paid order before range.startpurchasersInCurrentRun = in-memory set, add userId when flagging primarypacing:
batch_size: 1–50 events per envelope (single-event envelope is the safest default)throughput: few hundred requests/sec per project; pause 50–100ms every ~500 requestsretry: 4xx = log and skip; 5xx / network = exponential backoff; dead-letter after N retriesverify:
- GET /api/partner/health — last-event timestamp updated
- SQL: COUNT(*) and SUM(totalAmount) by day WHERE source='import' matches source DB
- spot-check isPrimarySale by user — only one TRUE per user across the replay