ML & AI

Training-Data Provenance Is the Real Alignment Problem

Before we argue about model behavior, we should be able to point to a row of data and explain how it got into the corpus. We mostly cannot.

Most public conversation about AI alignment is downstream of a much simpler problem: we generally cannot point to a row of training data and explain, in concrete terms, how it ended up in the corpus. Not “approximately” — concretely. Which crawl, which heuristic, which dedupe pass, which licensing assumption.

Provenance vs. lineage

It helps to separate two ideas. Lineage is the story of how data flows through your pipeline once you have it: cleaning, tokenization, sharding, mixing. Lineage is mostly a tooling problem, and the field is getting better at it.

Provenance is the story of where the data came from in the first place. That story tends to terminate in phrases like “Common Crawl, snapshot 2024-09” or “a partner dataset” or, worst, “we don’t remember exactly”. Provenance is upstream of lineage, and it’s where the genuinely hard ML/AI questions live.

A small, fixable problem

The smallest version of this problem is reproducibility. If a behavior surfaces in eval at version v37 of a model and is gone in v38, the obvious question is: what changed in the data? Today, for many teams, the honest answer is “we changed eight things and re-ran”. This is the same shape of failure as an unnamed deadline in a distributed system — without the marker written down, the post-hoc reconstruction is the only debugging surface you have. A provenance graph would let you answer it in one query.

# What we want to be able to say:
diff = corpus_v38.provenance_set() - corpus_v37.provenance_set()
print(diff.sources_added)   # which upstreams entered the mix
print(diff.sources_dropped) # which were filtered out

That doesn’t require a new framework. It requires that every source — every URL list, every partner dump, every synthetic generator — emits a stable identifier, and that the corpus retains that identifier per row.

The bigger reason it matters

The bigger reason is that regulation is not going to wait for us to figure this out. The EU AI Act, the various US state-level proposals, the inevitable consent-based opt-outs — every one of these requires the operator to answer a provenance question on demand. “We trained on a public mix” is not going to be enough. “We trained on these 4,217 sources, here is the manifest, here is when each was last refreshed, here is the legal basis we relied on for each” — that is going to be the bar.

Building toward that bar is mostly boring infrastructure: hashing, manifesting, signing, journaling. None of it is research-flavored. All of it pays off the first time someone asks a hard question and you can answer it with a query instead of a Slack thread.

The alignment debate is not going to make sense until we can have it grounded in concrete artifacts. Provenance is what makes the artifact concrete — and concreteness, as in every other corner of engineering, starts with a name we have agreed to.