Merge policies

Merge policies are the per-field configuration the reducer uses to collapse a stream of observations into a single entity snapshot. They are declarative, every policy is a strategy plus an optional tie-breaker, no inline code. This is what makes snapshot composition deterministic and replayable: the same observations and the same merge_policies always produce the same snapshot.

Sit inside reducer_config on every schema_registry row. Read once per snapshot recomputation. The reducer never falls back to ad-hoc logic, fields without an explicit policy use the documented default last_write.

Schema#

ReducerConfig and MergePolicy (TypeScript)

SQL / TS
Schema or pattern reference for this concept.
interface ReducerConfig { merge_policies: Record<string, MergePolicy>; } interface MergePolicy { strategy: | "last_write" | "highest_priority" | "most_specific" | "merge_array"; tie_breaker?: "observed_at" | "source_priority"; } // Example: invoice { "merge_policies": { "vendor_name": { "strategy": "highest_priority" }, "amount_due": { "strategy": "last_write" }, "status": { "strategy": "last_write" }, "aliases": { "strategy": "merge_array" }, "line_items": { "strategy": "merge_array" } } }

Four strategies#

last_write picks the most recent observation by observed_at, the right default for fields that change over time (status, amount, address). highest_priority picks the observation with the highest source_priority, the right choice for identity-shaped fields where a user correction (1000) should always beat a structured agent write (100) or AI extraction (0). most_specific picks the observation with the highest specificity_score, useful when one source produces dense, schema-aligned facts and another produces shallow ones. merge_array unions array values across observations, used for aliases, tags, and other accumulating sets.

Tie-breakers#

When two observations score equally under the chosen strategy, the tie_breaker decides. observed_at favours the more recent observation; source_priority favours the higher-priority writer. The default tie-breaker for last_write and most_specific is observed_at; for highest_priority it is source_priority. Ties are resolved deterministically, the reducer never picks at random.

Default behaviour for unmapped fields#

If a field appears in observations but has no entry in merge_policies, for instance, a removed field that still has historical observations, the reducer falls back to last_write. This keeps schema removal from corrupting historic snapshots: removed fields drop out of new snapshots via schema-projection filtering, but until then the policy is well-defined.

Source priority ladder#

highest_priority leans on the source_priority ladder set on each observation: 0 for AI interpretations, 100 for structured agent writes (store_structured), 1000 for explicit user corrections via the correct() path. This is what guarantees user corrections always win without requiring the reducer to know what 'a correction' is, corrections are just observations at priority 1000.

Invariants#

MUST

  • Be declarative, strategy + optional tie_breaker, no inline code
  • Be deterministic, same observations + same policies ⇒ same snapshot
  • Cover identity-shaped fields with highest_priority so corrections override AI
  • Use merge_array for accumulating sets (aliases, tags, line items, …)
  • Resolve ties with the documented tie_breaker, never randomly

MUST NOT

  • Run arbitrary code, merge logic lives in the reducer, not in policies
  • Use confidence as a merge signal, confidence is advisory only
  • Mix strategy types within a single field across versions without a major version bump
  • Override schema-projection filtering, removed fields drop out of snapshots regardless of policy

Where to go next#