Schema registry

The schema registry is the table that holds every versioned entity schema in Neotoma. It is config-driven by design: domain-specific schemas (contact, invoice, task, …) live as data, not code, so schemas can evolve at runtime without redeploys. Every schema row pairs a field-by-field schema_definition with a reducer_config that controls how observations merge into the entity snapshot.

Read on every observation write, every snapshot recomputation, and every schema-projection filter. Sits between the storage layer (sources/observations) and the deterministic reducer.

Schema#

schema_registry table (Postgres / hosted)

SQL / TS
Schema or pattern reference for this concept.
CREATE TABLE schema_registry ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), entity_type TEXT NOT NULL, schema_version TEXT NOT NULL, schema_definition JSONB NOT NULL, reducer_config JSONB NOT NULL, active BOOLEAN DEFAULT true, created_at TIMESTAMPTZ DEFAULT NOW(), user_id UUID REFERENCES auth.users(id), scope TEXT DEFAULT 'global' CHECK (scope IN ('global', 'user')), UNIQUE(entity_type, schema_version) );
FieldTypePurpose
entity_typeTEXTDomain type label (contact, invoice, task, conversation_message, …)
schema_versionTEXTSemantic version (1.0.0, 1.1.0, 2.0.0); unique per entity_type
schema_definitionJSONBField map: name → { type, required?, validator?, converters?, description? }
reducer_configJSONBPer-field merge_policies the reducer uses to compose observations into snapshots
activeBOOLEANExactly one active row per entity_type (per scope) at a time; new writes pick this up immediately
scopeTEXTglobal (shared) or user (per-user override that wins when caller's user_id matches)
user_idUUIDSet when scope = 'user'; lets one tenant evolve their schema without affecting others

Schema definition format#

schema_definition is a JSONB object with a single fields key. Each field carries a type (string | number | date | boolean | array | object), an optional required flag, an optional validator function name, an optional preserveCase flag for canonicalization, an optional description, and an optional converters list for deterministic type coercion (e.g. nanosecond timestamp → ISO 8601 date). The shape is intentionally narrow, schemas describe data, they do not run code.

Field type converters#

Converters reconcile real-world data (numeric timestamps, stringified booleans, nested arrays) with the declared field type without losing the original value. A converter is one of a small registry of named, deterministic functions (timestamp_nanos_to_iso, string_to_number, …). Successful conversions land in observations under the schema-typed field; the original value is mirrored into raw_fragments with reason converted_value_original so reprocessing remains lossless.

Global vs user-specific schemas#

Schemas resolve user-specific first, global second. A user-specific schema row (scope = 'user', user_id = caller) lets a tenant pilot new fields or stricter validators without affecting other users. When a user-specific pattern proves out across many users with consistent types, it can be promoted to a global schema via reconciliation.

Auto-enhancement from raw_fragments#

Unknown fields encountered at extraction time go to raw_fragments. With auto-enhancement enabled, the system analyses fragment frequency, type consistency, and source diversity, then promotes high-confidence fields (≥95% type consistency, ≥2 sources, ≥3 occurrences by default) into the active schema as a minor version bump. Field blacklists, name validators, and idempotency guards keep noise out.

Service interface#

register() inserts a new (entity_type, schema_version) row. activate() flips active = true on one version and false on the others within the same scope. updateSchemaIncremental() is the safe upgrade path: pass fields_to_add and/or fields_to_remove, optionally bump the version, optionally migrate historical raw_fragments. loadActiveSchema() is the read used by ingestion and the reducer.

Invariants#

MUST

  • Carry a non-null entity_type, schema_version, schema_definition, and reducer_config
  • Have at most one active row per (entity_type, scope, user_id) combination
  • Be referenced by every observation via observation.schema_version (immutable on observations)
  • Be the single source of truth for both validation and reducer merge policies
  • Validate every converter against CONVERTER_REGISTRY before registration

MUST NOT

  • Mutate schema_definition or reducer_config in place, register a new schema_version instead
  • Allow more than one active version per entity_type within the same scope
  • Carry merge logic (that lives in the reducer), only declarative merge_policies
  • Be edited from outside the schema registry service

Where to go next#