Storage layers
Neotoma uses a three-layer storage model so users can upload anything without losing data while still keeping the queryable state layer schema-compliant and deterministic. Every extraction touches all three layers: the original bytes go to raw_text, schema-defined fields go to observation properties, and everything else goes to raw_fragments, never silently dropped.
Spans the boundary between sources and observations. Ingestion partitions every extraction into these three layers before writing.
Schema#
Three-layer extraction shape (TypeScript)
Layer 1, raw_text on the source#
The source's raw bytes are immutable and content-addressed (SHA-256 + user_id). They never change after upload, never carry interpreted data, and are the artifact every reinterpretation reads from. Schema evolution does not require re-uploading, the same source can be reinterpreted under a newer schema version at any time.
Layer 2, observation.properties (schema-compliant)#
Only fields defined in the active schema_definition land in properties. Each properties payload includes schema_version. This is the layer queries hit (JSONB indexed), the layer entity extraction reads from, and the layer the reducer composes into snapshots. By construction it is deterministic: same input bytes + same schema_version + same converters ⇒ same properties.
Layer 3, raw_fragments (preservation)#
Anything extracted that doesn't match the active schema goes to raw_fragments, unknown fields, original values that were converted, validation warnings, and extraction quality metrics. raw_fragments is the substrate auto-enhancement and the schema expansion architecture analyse to suggest schema upgrades. Because nothing is dropped, schema evolution is non-destructive: re-adding a field surfaces its historical values via schema-projection filtering.
Partition rules#
Fields named in the active schema → properties. Fields not named in the schema → raw_fragments as unknown fields. Missing required fields produce warnings on the observation; observations are never rejected for missing optional fields, and never rejected for unknown fields. Required-field failure produces a warning, not a write rejection, the system always preserves what was extracted.
What entity / snapshot composition reads#
Entity extraction and snapshot computation read from properties only. raw_fragments is explicitly excluded from snapshot composition, it is a holding area, not a query target. This is what keeps snapshots deterministic and schema-aligned even when extraction surfaces extra data.
Invariants#
MUST
- Preserve all extracted data, unknown fields go to raw_fragments, never discarded
- Always create an observation, even on missing-required-field warnings
- Stamp schema_version on every properties payload
- Pull entity extraction and snapshot composition fields from properties only
- Mirror converter inputs into raw_fragments with reason converted_value_original
MUST NOT
- Reject observations for unknown fields
- Reject observations for missing optional fields
- Store unknown or non-schema fields in observation.properties
- Use raw_fragments for entity extraction or snapshot composition
- Modify or guess field values during partitioning
Related#
- Schema handling architecture , Three-layer model, partition logic, validation rules
- Schema registry , Where the active schema_definition lives
- Observations , How properties on observations feed the reducer
- Sources , Layer 1, content-addressed raw bytes
- Schema expansion , How raw_fragments seed automatic schema growth
Where to go next#
- All schema concepts , registry, merge policies, storage layers, versioning
- Primitive record types , sources, observations, snapshots, and the rest of Neotoma's atoms
- Schema management workflows , CLI commands for listing, validating, and evolving schemas