Schema registry

The schema registry is the table that holds every versioned entity schema in Neotoma. It is config-driven by design: domain-specific schemas (contact, invoice, task, …) live as data, not code, so schemas can evolve at runtime without redeploys. Every schema row pairs a field-by-field schema_definition with a reducer_config that controls how observations merge into the entity snapshot.

Read on every observation write, every snapshot recomputation, and every schema-projection filter. Sits between the storage layer (sources/observations) and the deterministic reducer.

Schema#

schema_registry table (Postgres / hosted)

SQL / TS

Schema or pattern reference for this concept.

CREATE TABLE schema_registry (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  entity_type TEXT NOT NULL,
  schema_version TEXT NOT NULL,
  schema_definition JSONB NOT NULL,
  reducer_config JSONB NOT NULL,
  active BOOLEAN DEFAULT true,
  created_at TIMESTAMPTZ DEFAULT NOW(),
  user_id UUID REFERENCES auth.users(id),
  scope TEXT DEFAULT 'global' CHECK (scope IN ('global', 'user')),
  UNIQUE(entity_type, schema_version)
);

Field	Type	Purpose
`entity_type`	`TEXT`	Domain type label (contact, invoice, task, conversation_message, …)
`schema_version`	`TEXT`	Semantic version (1.0.0, 1.1.0, 2.0.0); unique per entity_type
`schema_definition`	`JSONB`	Field map: name → { type, required?, validator?, converters?, description? }
`reducer_config`	`JSONB`	Per-field merge_policies the reducer uses to compose observations into snapshots
`active`	`BOOLEAN`	Exactly one active row per entity_type (per scope) at a time; new writes pick this up immediately
`scope`	`TEXT`	global (shared) or user (per-user override that wins when caller's user_id matches)
`user_id`	`UUID`	Set when scope = 'user'; lets one tenant evolve their schema without affecting others

Schema definition format#

schema_definition is a JSONB object with a single fields key. Each field carries a type (string | number | date | boolean | array | object), an optional required flag, an optional validator function name, an optional preserveCase flag for canonicalization, an optional description, and an optional converters list for deterministic type coercion (e.g. nanosecond timestamp → ISO 8601 date). The shape is intentionally narrow, schemas describe data, they do not run code.

Field type converters#

Converters reconcile real-world data (numeric timestamps, stringified booleans, nested arrays) with the declared field type without losing the original value. A converter is one of a small registry of named, deterministic functions (timestamp_nanos_to_iso, string_to_number, …). Successful conversions land in observations under the schema-typed field; the original value is mirrored into raw_fragments with reason converted_value_original so reprocessing remains lossless.

Global vs user-specific schemas#

Schemas resolve user-specific first, global second. A user-specific schema row (scope = 'user', user_id = caller) lets a tenant pilot new fields or stricter validators without affecting other users. When a user-specific pattern proves out across many users with consistent types, it can be promoted to a global schema via reconciliation.

Auto-enhancement from raw_fragments#

Unknown fields encountered at extraction time go to raw_fragments. With auto-enhancement enabled, the system analyses fragment frequency, type consistency, and source diversity, then promotes high-confidence fields (≥95% type consistency, ≥2 sources, ≥3 occurrences by default) into the active schema as a minor version bump. Field blacklists, name validators, and idempotency guards keep noise out.

Service interface#

register() inserts a new (entity_type, schema_version) row. activate() flips active = true on one version and false on the others within the same scope. updateSchemaIncremental() is the safe upgrade path: pass fields_to_add and/or fields_to_remove, optionally bump the version, optionally migrate historical raw_fragments. loadActiveSchema() is the read used by ingestion and the reducer.

Invariants#

MUST

Carry a non-null entity_type, schema_version, schema_definition, and reducer_config
Have at most one active row per (entity_type, scope, user_id) combination
Be referenced by every observation via observation.schema_version (immutable on observations)
Be the single source of truth for both validation and reducer merge policies
Validate every converter against CONVERTER_REGISTRY before registration

MUST NOT

Mutate schema_definition or reducer_config in place, register a new schema_version instead
Allow more than one active version per entity_type within the same scope
Carry merge logic (that lives in the reducer), only declarative merge_policies
Be edited from outside the schema registry service

Schema registry doc , Full table definition, definition format, service interface
Merge policies , How reducer_config drives deterministic snapshot merging
Storage layers , Three-layer storage: raw_text, properties, raw_fragments
Versioning & evolution , Semver rules, breaking changes, schema snapshot exports
Schema definitions (code) , src/services/schema_definitions.ts, current source of truth in code

Where to go next#

All schema concepts , registry, merge policies, storage layers, versioning
Primitive record types , sources, observations, snapshots, and the rest of Neotoma's atoms
Schema management workflows , CLI commands for listing, validating, and evolving schemas