Storage layers

Neotoma uses a three-layer storage model so users can upload anything without losing data while still keeping the queryable state layer schema-compliant and deterministic. Every extraction touches all three layers: the original bytes go to raw_text, schema-defined fields go to observation properties, and everything else goes to raw_fragments, never silently dropped.

Spans the boundary between sources and observations. Ingestion partitions every extraction into these three layers before writing.

Schema#

Three-layer extraction shape (TypeScript)

SQL / TS

Schema or pattern reference for this concept.

// Returned by extractAndValidate()
{
  // Layer 1: raw_text, immutable original bytes, lives on the source
  // (already stored on the sources row by the time extraction runs)

  // Layer 2: properties, schema-compliant only, deterministic, queryable
  properties: {
    schema_version: "1.0",
    invoice_number: "INV-001",
    amount: 1500.0,
    currency: "USD",
    date_issued: "2024-01-15T00:00:00Z",
    vendor_name: "Acme Corp"
  },

  // Layer 3: extraction_metadata, preservation layer
  extraction_metadata: {
    unknown_fields: {
      purchase_order: "PO-789",
      internal_cost_center: "CC-456"
    },
    warnings: [
      {
        type: "unknown_field",
        field: "purchase_order",
        message: "Field not defined for type 'invoice', preserved in extraction_metadata"
      }
    ],
    extraction_quality: {
      fields_extracted_count: 7,
      fields_filtered_count: 2,
      matched_patterns: ["invoice_number_pattern", "amount_due_pattern"]
    }
  }
}

Layer 1, raw_text on the source#

The source's raw bytes are immutable and content-addressed (SHA-256 + user_id). They never change after upload, never carry interpreted data, and are the artifact every reinterpretation reads from. Schema evolution does not require re-uploading, the same source can be reinterpreted under a newer schema version at any time.

Layer 2, observation.properties (schema-compliant)#

Only fields defined in the active schema_definition land in properties. Each properties payload includes schema_version. This is the layer queries hit (JSONB indexed), the layer entity extraction reads from, and the layer the reducer composes into snapshots. By construction it is deterministic: same input bytes + same schema_version + same converters ⇒ same properties.

Layer 3, raw_fragments (preservation)#

Anything extracted that doesn't match the active schema goes to raw_fragments, unknown fields, original values that were converted, validation warnings, and extraction quality metrics. raw_fragments is the substrate auto-enhancement and the schema expansion architecture analyse to suggest schema upgrades. Because nothing is dropped, schema evolution is non-destructive: re-adding a field surfaces its historical values via schema-projection filtering.

Partition rules#

Fields named in the active schema → properties. Fields not named in the schema → raw_fragments as unknown fields. Missing required fields produce warnings on the observation; observations are never rejected for missing optional fields, and never rejected for unknown fields. Required-field failure produces a warning, not a write rejection, the system always preserves what was extracted.

What entity / snapshot composition reads#

Entity extraction and snapshot computation read from properties only. raw_fragments is explicitly excluded from snapshot composition, it is a holding area, not a query target. This is what keeps snapshots deterministic and schema-aligned even when extraction surfaces extra data.

Invariants#

MUST

Preserve all extracted data, unknown fields go to raw_fragments, never discarded
Always create an observation, even on missing-required-field warnings
Stamp schema_version on every properties payload
Pull entity extraction and snapshot composition fields from properties only
Mirror converter inputs into raw_fragments with reason converted_value_original

MUST NOT

Reject observations for unknown fields
Reject observations for missing optional fields
Store unknown or non-schema fields in observation.properties
Use raw_fragments for entity extraction or snapshot composition
Modify or guess field values during partitioning

Schema handling architecture , Three-layer model, partition logic, validation rules
Schema registry , Where the active schema_definition lives
Observations , How properties on observations feed the reducer
Sources , Layer 1, content-addressed raw bytes
Schema expansion , How raw_fragments seed automatic schema growth

Where to go next#

All schema concepts , registry, merge policies, storage layers, versioning
Primitive record types , sources, observations, snapshots, and the rest of Neotoma's atoms
Schema management workflows , CLI commands for listing, validating, and evolving schemas