The Same Problem, Twenty Times Over
A new vendor lands on the roadmap. You skim the docs. The auth is almost OAuth2 but with a quirk. Pagination is cursor-based, except for one endpoint that’s offset-based, and another that’s a time window. The schema looks superficially like the last vendor’s — same fields, different names, different nesting, different timestamp format, different idea of what counts as an “actor.” You write the extractor. You write the pagination loop. You write the schema mapping. You wire the checkpoints. You debug the rate-limit handling for a week because the vendor’s documented limits are not the real limits. Three weeks gone.
Then the next vendor arrives. And you do all of it again.
We built more than a dozen security tools into our platform before we admitted the obvious: we kept solving the same problem with different code. Different auth shims, different pagination wrappers, different per-vendor schemas, different observability glue. Each integration looked unique up close, but the silhouettes all rhymed. So we stopped treating every vendor as a snowflake and started treating integrations as composition — pick the pattern, plug in the bricks, fill in the vendor-specific blanks.
This article is the playbook we wish we’d had on day one. It’s our opinion, not a standard — your archetype list and your pack list will look different, and that’s fine. The discipline of having a taxonomy matters more than which one you pick. By the end you’ll have three things:
- A mental model — security integrations as archetype + capability packs. The eight buckets we keep landing in. The seven packs we’ve found sufficient.
- A worked example — GitHub Audit Log API, traced end-to-end through every concept, so you can see how the abstractions land in real code.
- An 8-step checklist — what to do on Monday morning when the next vendor shows up. Copy it, paste it, work it.
Nothing revolutionary here. Just the patterns we wish someone had told us before we wrote the same code a dozen times.
The Archetype Mindset
The first claim of this playbook is the one that took us longest to accept: every integration we’ve shipped fell into one of eight buckets, and the buckets — not the vendors — are what we should be designing around.
An integration archetype is a category of vendor data that demands the same shape — schema, semantics, idempotency rules, replay model — regardless of which vendor produced it. The vendor varies only the source mapping; the target is fixed. Once we accepted this, “build a new integration” stopped meaning “design something new” and started meaning “pick one of eight buckets and fill in the mapping.” That single reframe is what makes the rest of the platform tractable.
After more than a dozen integrations we keep landing in the same eight buckets. Your taxonomy might land on six or eleven — the count matters less than the discipline of having one. Here are ours, with the OCSF target each one normalizes to and a few vendors that fall into each:
Eight Archetypes, Many Vendors
Every security integration we’ve shipped fits in one of these rows.
A quick aside for anyone new to the schema column: OCSF is the Open Cybersecurity Schema Framework — an open standard originally proposed by Splunk and AWS and now developed in the open by a broad vendor community. It gives security data a shared vocabulary: every “API call” looks like an API Activity event (class_uid 6003), every “vulnerability” looks like a Vulnerability Finding (class_uid 2002), and so on. We don’t have to invent these contracts; we just have to map to them.
Coverage honest-up: OCSF covers four of these eight archetypes cleanly today — audit-event-logs (API Activity 6003), findings (Vulnerability Finding 2002), alerts-incidents (Detection Finding 2004), and identity-access (Account Change / User Inventory). The other four — inventory-snapshot, ticketing-itsm, posture-scores, and enrichment — either don’t have OCSF classes yet or have classes that only partially fit (OCSF’s Software Inventory Info 5020 covers SBOM-style package lists, not general cloud-asset or container inventory). For those four, we maintain small canonical schemas internally and watch the OCSF roadmap. The mapping gap is real but bounded.
The punchline. Archetypes pin the target schema. OCSF gives the shared vocabulary where it has coverage. Only the vendor mapping differs. Every audit log on Earth — CloudTrail, Azure Activity, GitHub, Okta — ends up in the same Silver table (api_activity_6003). New vendor, zero new dashboards.
This is also why classification is the first step of every integration. Before we touch auth or pagination, we run a four-question test against the source: does each record have an actor, an action, a resource or target, and a timestamp?
If yes, it’s an audit-event-log and the target is API Activity (6003). If instead each record describes a finding — a CVE, a CVSS score, an affected device — it’s the findings archetype (2002). If records describe current state of a resource rather than an event, it’s inventory-snapshot. The classification is rarely ambiguous in practice, and when it is, the ambiguity itself is a signal that the source is doing two things and should be split into two extractors.
Worked example — GitHub Audit Log. A GitHub audit event has an actor (the user or app that performed the action), an action (repo.create, team.add_member, org.update_member), a repo / org / team as the target resource, and a @timestamp. Four-for-four on the heuristic, so the archetype is audit-event-logs and the OCSF target is API Activity (class_uid 6003). Concretely: GitHub events land in the same Silver table as AWS CloudTrail and Azure Activity Log.
The “show me everything user X did across our stack in the last 24 hours” dashboard already exists — it just starts returning GitHub rows the moment our extractor turns on. The detection scaffolding our security team already wrote over api_activity_6003 (query surface, joins, time windows) fires against GitHub events on day one — only the per-vendor action vocabulary needs to be added, and that’s a small mapping table, not a new pipeline.
That’s the whole leverage of the archetype mindset. We don’t gain it by writing better code per vendor; we gain it by refusing to write a bespoke schema per vendor.
Capability Packs: The Lego Bricks
If archetypes pin the target, capability packs handle the journey to get there. A capability pack is a reusable solution to a cross-cutting concern that every integration has to solve — authenticating to the source, walking its pages, surviving its rate limits, remembering where it left off, tolerating its schema changes, watching it in production, and never leaking the credential it uses. We treat these concerns as pre-built Lego bricks: each pack has a stable interface, a small set of well-tested implementations, and a configuration surface narrow enough to fit on a page.
New integrations don’t reinvent them; they configure them. Each integration is archetype + a combo of packs, and the only thing that genuinely changes between vendors is which packs you pick and how you tune them.
There are seven we’ve found sufficient in our platform. When we thought we needed an eighth, the “new” concern usually turned out to be one of these seven wearing a different hat — though there are real exceptions we haven’t hit yet (data-quality validation, lineage, dead-letter queues for webhook receivers, schema-registry/contract testing all live just outside the seven below and become first-class when you scale).
- Auth. API key, Bearer token, OAuth2 client credentials, IAM role, or workload identity. Prefer short-lived; cache in memory only; never write to disk or logs.
- Pagination. Cursor, page-number, offset, time-window, or Link-header. Each has different idempotency and replay properties — choose with that in mind.
- Rate limiting. Token-bucket throttle on the way out, exponential backoff with jitter on 429, and always respect Retry-After (or the vendor’s reset header) before you guess.
- Checkpoints. A persisted watermark per tenant per integration, with a small overlap window so late-arriving events aren’t silently dropped on the next run.
- Schema drift. Pin the silver-layer schema. Unknown fields route to an unmapped_properties JSON column. We do not ALTER TABLE on every vendor surprise.
- Observability. Per-window event counts, p95 extraction latency, watermark lag, and dedup ratio — emitted with consistent tags so every integration shows up on the same dashboard.
- Secrets hygiene. Credentials come from a managed secrets store, are scoped per tenant, never appear in git or logs, and are re-fetched at the start of each run so rotation is invisible.
A note on delivery mode
The article’s worked example pulls from a REST API on a schedule, which is one delivery mode among four common ones. Delivery mode is orthogonal to archetype — a findings source might arrive as a daily S3 export or as a webhook stream — and it changes which packs dominate.
- Pull-batch (GitHub Audit, AWS Inspector, most SaaS REST APIs) — what most of this article describes. Watermarks and pagination dominate.
- Push / webhook (Okta Event Hooks, GitHub webhooks, AWS EventBridge → HTTPS) — signature verification, idempotent receipt, and a dead-letter queue dominate; watermarks aren’t a thing.
- Streaming (Kafka topics from a SIEM, syslog/CEF over TCP) — consumer-group offsets replace watermarks; backpressure dominates.
- File-drop (vendor exports to S3 or SFTP) — manifest detection, partial-write protection, and schema-on-read dominate.
The seven packs above still apply across all four modes, but their configuration shifts. We mention this because the worked example below is pull-batch — don’t read its specifics as universal.
GitHub Audit Log pack inventory
Put this against a real source and the abstraction stops being abstract. Here is the pack inventory we’d write for GitHub’s organization audit log (pull-batch over REST):
| Pack | Choice |
|---|---|
| Auth | Bearer token — classic PAT with admin:org scope (or read:audit_log on GitHub Enterprise), or a GitHub App with Organization administration: read |
| Pagination | Cursor via Link: <…>; rel=”next” response header |
| Rate limit | Subject to GitHub’s primary (5,000 req/hr for PATs) and secondary rate limits; installation tokens scale higher — see GitHub’s rate-limit docs |
| Checkpoints | Watermark on event timestamp, with overlap tuned to observed delivery latency (see Section 5 rule #3) |
| Schema drift | New audit action names appear constantly → route to unmapped_properties |
| Observability | Log events_received per window; alert on empty unexpected windows |
That table is, almost in its entirety, the integration-specific configuration that GitHub demands of our platform. The other five packs inherit their defaults: standard structured logging, the platform’s secrets store, the platform’s drift router, the platform’s retry policy. None of the rows above are GitHub-specific code — they’re six dials on a pre-built chassis.
SDKs first
Before showing the code, one disclaimer: when the vendor provides a well-maintained SDK (boto3 for AWS, octokit for GitHub, okta-sdk-python, slack-sdk), use it. We only drop to raw HTTP when we need multi-tenant token rotation that the SDK doesn’t expose, fine-grained retry control beyond the SDK’s defaults, or consistent observability hooks across vendors. For a typical first integration, an SDK plus the seven packs is enough — the sketches below are illustrative of shape, not a recommendation to hand-roll HTTP for every source.
Auth — Bearer header with a token-provider abstraction
def bearer_headers(token_provider):
# token_provider is a callable returning a fresh, in-memory token.
# It hides whether the source is a static PAT, OAuth2 refresh,
# an IAM-issued token, or a per-tenant secrets lookup.
token = token_provider()
return {
"Authorization": f"Bearer {token}",
"Accept": "application/json",
}
The token_provider is the seam. Swap a PAT for a GitHub App installation token, or for an OAuth2 client-credentials flow against another vendor, and nothing else in the extractor changes.
Pagination — Link-header cursor as a generator
import re
def iter_pages(session, url, headers):
while url:
resp = session.get(url, headers=headers)
resp.raise_for_status()
yield resp.json()
link = resp.headers.get("Link", "")
# Illustrative only — naive regex that does not handle the full
# RFC 8288 Link-header grammar (quoted strings with commas,
# multiple link-values, alternate parameter ordering, etc.).
#
# In production use a real parser:
# requests.utils.parse_header_links
# or httpwg structured-headers implementation
m = re.search(r'<([^>]+)>;\s*rel="next"', link)
url = m.group(1) if m else None
A generator, not a list — so the caller can stream pages straight to bronze, checkpoint between them, and stop early on shutdown without buffering the whole extraction in memory.
The deeper point: when we onboard the next audit-log vendor — Okta, Azure Activity, CloudTrail, the next one after that — we don’t write new auth code, new pagination code, new retry code, or new secrets plumbing. We pick the packs, fill in the configuration, and spend our effort on the only part that’s genuinely new: the source-to-OCSF mapping. That’s where the leverage lives.
The Pipeline Skeleton: Bronze → Silver → Gold
Once we picked our archetype and our capability packs, we still needed a place to put the data. We adopted the medallion architecture — Bronze, Silver, Gold — and we keep the labels because they’re widely understood, even when “Bronze” and “Silver” are just nicknames for “raw layer” and “conformed layer.” What actually matters is that each layer answers a different question and has a different contract with the next one downstream.
We think of the three layers like this:
| Layer | The question it answers | What lives here | Schema rule |
|---|---|---|---|
| Bronze | Do we have the data? | Raw vendor payload, append-only, partitioned by ingest date. No business logic, no renaming, no type coercion. | Vendor-shaped — whatever the API returned. |
| Silver | Do we trust the data? | Archetype-normalized to OCSF. One canonical table per archetype, shared across every vendor that fits it. | Fixed per archetype. |
| Gold | Can we query the data? | Joined to dimensions (users, assets, accounts), partitioned by event time, shaped for analysts and dashboards. | Business-shaped, vendor-agnostic. |
Bronze — “do we have the data?”
Bronze is the cheapest layer to get right and the most expensive one to get wrong. Its only job is to faithfully record what the vendor sent us, so that if anything downstream is broken we can rebuild from scratch without going back to the API. That means no transformations, no cleverness, no opinions. The raw JSON lands in an Iceberg table, partitioned by the date we ingested it, and stays there. If we later discover we mapped a field incorrectly, we replay from Bronze. We do not re-extract.
Silver — “do we trust the data?”
Silver is where the archetype earns its keep. Every audit-log vendor on Earth — CloudTrail, Azure Activity, GitHub Audit, Okta System Log — lands in the same Silver table, with the same columns, in OCSF API Activity shape (class_uid 6003). The mapping code is per-vendor, but the target is not. This is the discipline that makes the whole architecture work:
Silver schema is fixed per archetype. Unknown fields go to a JSON column. This is what keeps dashboards vendor-agnostic.
A new audit-log vendor cannot add a column. It can only add a row. Vendor-specific fields that do not map to OCSF — Azure’s RBAC claims, CloudTrail’s request parameters, GitHub’s per-event metadata — land in an unmapped_properties JSON blob. They are not lost; they are just not promoted to first-class columns. Promoting them would force every downstream query to know which vendor it was looking at, which is the exact problem we are trying to avoid.
Gold — “can we query the data?”
Gold is where the data meets the business. We join to the dimensions analysts actually ask about — user, asset, account, tenant — and we partition by event time, not ingest time, because compliance questions (“show me everything user X did between Y and Z”) only make sense against event time. Because Silver is vendor-agnostic, Gold is too. The “user activity” mart does not know or care whether a row originated in CloudTrail or GitHub.
GitHub Audit Log, end to end
See Figure 2 for the GitHub example end to end. The trace is short: the GitHub Audit Log API lands in a Bronze table called github_audit_log, raw payload, ingest-date partitioned. The Silver transform maps it into api_activity_6003, the same table that already holds CloudTrail and Azure Activity events. Gold is the existing “user activity” mart — the dashboard team writes zero new code. The integration is done when the new rows show up in the existing dashboard.
That is the payoff.
Three Layers, One Convergence Point
GitHub Audit Log traced end-to-end. Bronze is per-vendor. Silver is shared.
partitioned by ingest_date
partitioned by ingest_date
partitioned by ingest_date
schema: FIXED per archetype
partitioned by event_time
merge on stable event id
unmapped_properties → JSON
metadata_provider: github |
cloudtrail | azure_activity
joined to user / asset / account
analyst-shaped, dashboard-ready
The Disciplines That Separate Prototypes from Production
A prototype ingests a vendor’s data once. A production integration ingests it every fifteen minutes for the next five years, survives backfills, replays, vendor outages, audits, and privacy reviews — without quietly corrupting itself or anyone’s data. The gap between those two states is not framework choice or cleverness. It is seven disciplines we now apply to every integration without exception. Skip any of them and you will eventually pay, usually at the worst possible moment.
- Idempotency by design: MERGE on a stable event ID. Re-running the same window must produce the same bytes. We use a deterministic merge key — for GitHub audit events ingested via GitHub’s Splunk integration that’s _document_id, and for the raw REST API a composite of (@timestamp, actor, action, created_at) works as a stable surrogate. We write with a MERGE-style overwrite of the affected partitions. The minute you append instead of merge, every retry, backfill, and replay quietly doubles your row counts. Detection rules that count events become wrong. Compliance reports that count events become wrong. And nobody notices until an auditor does.
- Partition by event time, not processing time. A late event that occurred at 23:58 yesterday must land in yesterday’s partition, not today’s. Compliance and detection queries are phrased in event time: “show me every privileged action between 14:00 and 15:00 on the 12th.” If you partition by ingest time, late arrivals — and they always arrive late, because every vendor’s API is eventually consistent — silently disappear from those queries. You will pass tests for weeks before someone notices the gap.
- Silver schema is fixed; unknown fields go to a JSON column. Vendors add fields whenever they feel like it. If your Silver schema mutates every time that happens, you cannot unify across vendors, your dashboards break on every drift, and every downstream consumer plays whack-a-mole with new columns. We pin the Silver schema to the archetype’s canonical shape and route everything unmapped into a single JSON column. Surprises become inspectable data, not migrations.
- Pick a multi-tenancy isolation model and enforce it mechanically. We chose schema-per-tenant because we wanted isolation by construction — one schema per customer, every query scoped by the schema itself rather than by a WHERE clause we have to remember to write. The trade-offs are real: metadata bloat past a few thousand tenants, fan-out cost on every migration, more catalog objects to monitor. Many serious platforms instead enforce a tenant column at the catalog/policy layer (Snowflake row-access policies, BigQuery authorized views, Databricks Unity Catalog RLS, Postgres RLS) with equal safety and better scaling characteristics. Either model can be production-grade; what we’d refuse to ship is a bare tenant_id filter with the isolation living only in handwritten query conventions. Pick a model, encode it where humans can’t accidentally bypass it, and move on.
- metadata_provider is globally unique per integration. Every event in the shared Silver table carries a provider tag: github, cloudtrail, azure_activity. We treat this string as a namespace and forbid reuse. Two integrations sharing a value silently corrupt the shared table — joins overcount, deduplication merges unrelated events, and the contamination spreads to every Gold mart downstream. Nobody notices for weeks, because the rows still look plausible.
- PII minimization at the Bronze → Silver boundary. Bronze keeps the raw vendor payload for replay; Silver is what analysts query, dashboards visualize, and rules run against. For every PII-bearing field a vendor sends us — user emails, source IPs, request bodies, internal note text, document titles — we make one of four calls at the mapping boundary: drop (the field stays in Bronze only and never enters Silver), hash with a per-tenant salt (preserves join semantics within the tenant, prevents cross-tenant correlation), mask to a redacted form (last-4 of a card, domain-only of an email), or keep with explicit justification. The default is drop, not keep. This is also when we set Bronze retention — typically much shorter than Silver’s regulatory tail — because the minimization in Silver is what makes long retention defensible. Skip this discipline and your blog post about dashboards becomes a blog post about how your dashboards leaked customer IPs.
A Build Checklist for Monday
The Monday-morning vendor playbook: any source, any API, in eight moves.
So what does this look like when the next vendor lands on your desk? Here’s the same playbook, condensed. A Monday-morning starting point you can apply to any vendor regardless of API shape, auth model, or data volume.
Read it top to bottom once. After that it becomes muscle memory, and every new connector starts looking like the last one.
Decide what kind of source you’re actually looking at.
Run the four-question test: actor + action + target + timestamp.
Use this matrix to pick the archetype before you write any code. If two answers fit, the source is doing two things and should be split into two extractors.
The archetype tells you the target shape; the delivery mode tells you which packs dominate.
Pick exactly one. It changes which capability packs you need more than anything else.
List the packs this vendor will need before you write a line.
Most of the cost in any connector lives in these capabilities, not in the business logic. Pagination becomes signature verification on webhooks, or offset management on streams. The slot is the same, the contents change with delivery mode.
Decide what fields survive, what gets dropped, what gets masked.
OCSF target ⇔ vendor source, with explicit decisions at every boundary.
Three decisions, made upfront in a document, not in code review:
1. Which vendor fields map cleanly to OCSF columns. 2. Which fields stay unmapped and route to a JSON column for forensic preservation. 3. Which PII-bearing fields to drop, hash, mask, or keep at the BronzeSilver boundary.
Three modules, three jobs. No logic crosses lines.
A single module with no business logic.
Authenticate, paginate (or subscribe, or list-and-download), and yield raw records downstream untouched. That’s the whole job.
Prefer a vendor SDK if one is well maintained. Drop to raw HTTP only when you need multi-tenant token rotation or fine-grained retry control. Those are the two places SDKs almost always disappoint.
Bronze Raw payloads land in Iceberg, append-only.
Partition by ingest date. Never mutate. Never apply business logic at this layer. Bronze is the replay tape. If Silver is wrong tomorrow, you re-derive it from here without re-pulling from the vendor.
Silver Bronze rows normalize into archetype-shaped OCSF.
Unknown fields serialize to a JSON column rather than being dropped. The Silver schema is fixed per archetype, not per vendor. So a dashboard built on audit-event-log keeps working when you onboard the next IdP, the next EDR, the next CASB.
Nothing ships until every gate is green.
Pipeline status, freshness, alerting, and a verification pass.
Stand up pipeline status, freshness metrics, alerting, and an empty-window alarm. Then run the verification pass against the archetype’s quality gate. The run must reproduce end-to-end before the connector counts as shipped.
- Schema conformance
- Partitioning
- Idempotency
- Multi-tenant isolation
- PII minimization
- End-to-end reproduction
The playbook compresses to a single sentence: classify, deliver, inventory, map, extract, land, normalize, gate.
Every vendor looks bespoke on the first read. By step three, they all look the same. That’s the point.
Why This Pays Off
The first integration is expensive. The second is cheaper. The twentieth is almost free. That curve is the whole point.
When we started, every vendor was a from-scratch project. Now most of the work is done before we open the spec. The archetype pins the target schema. The capability packs cover the cross-cutting concerns. The Bronze/Silver/Gold skeleton and the dashboards already exist. What’s left is the vendor-specific mapping — fill-in-the-blank, not a research project. The platform gets cheaper to extend over time, not more expensive.
Three concrete ways this compounds:
- Adding the Nth audit-log vendor takes hours, not weeks. Same archetype, same Silver table (OCSF API Activity, 6003), same Gold mart. We write the extraction module and the source-to-OCSF mapping; everything downstream already exists.
- Detection scaffolding compounds across vendors — detection logic still needs a thin per-vendor layer. OCSF 6003 standardizes the envelope — actor, action, time, target — so the query surface, joins, time windows, and severity scoring of every detection rule are written once. The per-vendor action vocabulary (cloudtrail: ConsoleLogin vs. github: org.update_member vs. azure: Microsoft.Authorization/roleAssignments/write) still needs a mapping table. The expensive 80% is shared; the irreducible 20% stays small and lives in one place.
- Dashboards built for one vendor’s data work for all vendors of that archetype. “User activity over time” doesn’t care whether the row came from Okta, GitHub, or Entra ID. The vendor column filters; the visualization stays the same.
None of this works without open standards and open infrastructure: OCSF gives us the shared vocabulary, Apache Iceberg the storage format, Apache Spark the compute. We composed them, we didn’t invent them.
A note on portability: the patterns in this article (archetype + capability packs + medallion + the seven production disciplines) are stack-agnostic.
Iceberg + Spark is what we run; the same discipline works just as well on Snowflake (with row-access policies for tenancy), BigQuery + dbt, Databricks + Unity Catalog, Trino + Iceberg, or even DuckDB for a small deployment.
Pick the substrate that matches your team and budget — the architecture is portable.