Neotoma

Storage layers

Neotoma uses a three-layer storage model so users can upload anything without losing data while still keeping the queryable state layer schema-compliant and deterministic. Every extraction touches all three layers: the original bytes go to raw_text, schema-defined fields go to observation properties, and everything else goes to raw_fragments, never silently dropped.

Spans the boundary between sources and observations. Ingestion partitions every extraction into these three layers before writing.

Schema#

Three-layer extraction shape (TypeScript)

SQL / TS
Schema or pattern reference for this concept.
// Returned by extractAndValidate() { // Layer 1: raw_text, immutable original bytes, lives on the source // (already stored on the sources row by the time extraction runs) // Layer 2: properties, schema-compliant only, deterministic, queryable properties: { schema_version: "1.0", invoice_number: "INV-001", amount: 1500.0, currency: "USD", date_issued: "2024-01-15T00:00:00Z", vendor_name: "Acme Corp" }, // Layer 3: extraction_metadata, preservation layer extraction_metadata: { unknown_fields: { purchase_order: "PO-789", internal_cost_center: "CC-456" }, warnings: [ { type: "unknown_field", field: "purchase_order", message: "Field not defined for type 'invoice', preserved in extraction_metadata" } ], extraction_quality: { fields_extracted_count: 7, fields_filtered_count: 2, matched_patterns: ["invoice_number_pattern", "amount_due_pattern"] } } }

Layer 1, raw_text on the source#

The source's raw bytes are immutable and content-addressed (SHA-256 + user_id). They never change after upload, never carry interpreted data, and are the artifact every reinterpretation reads from. Schema evolution does not require re-uploading, the same source can be reinterpreted under a newer schema version at any time.

Layer 2, observation.properties (schema-compliant)#

Only fields defined in the active schema_definition land in properties. Each properties payload includes schema_version. This is the layer queries hit (JSONB indexed), the layer entity extraction reads from, and the layer the reducer composes into snapshots. By construction it is deterministic: same input bytes + same schema_version + same converters ⇒ same properties.

Layer 3, raw_fragments (preservation)#

Anything extracted that doesn't match the active schema goes to raw_fragments, unknown fields, original values that were converted, validation warnings, and extraction quality metrics. raw_fragments is the substrate auto-enhancement and the schema expansion architecture analyse to suggest schema upgrades. Because nothing is dropped, schema evolution is non-destructive: re-adding a field surfaces its historical values via schema-projection filtering.

Partition rules#

Fields named in the active schema → properties. Fields not named in the schema → raw_fragments as unknown fields. Missing required fields produce warnings on the observation; observations are never rejected for missing optional fields, and never rejected for unknown fields. Required-field failure produces a warning, not a write rejection, the system always preserves what was extracted.

What entity / snapshot composition reads#

Entity extraction and snapshot computation read from properties only. raw_fragments is explicitly excluded from snapshot composition, it is a holding area, not a query target. This is what keeps snapshots deterministic and schema-aligned even when extraction surfaces extra data.

Invariants#

MUST

  • Preserve all extracted data, unknown fields go to raw_fragments, never discarded
  • Always create an observation, even on missing-required-field warnings
  • Stamp schema_version on every properties payload
  • Pull entity extraction and snapshot composition fields from properties only
  • Mirror converter inputs into raw_fragments with reason converted_value_original

MUST NOT

  • Reject observations for unknown fields
  • Reject observations for missing optional fields
  • Store unknown or non-schema fields in observation.properties
  • Use raw_fragments for entity extraction or snapshot composition
  • Modify or guess field values during partitioning

Where to go next#