Neotoma

Sources

A source is the raw, content-addressed artifact that every other primitive ultimately traces back to: a file you uploaded, a webhook payload, a structured agent write. Sources are deduplicated per user by SHA-256 content hash so the same bytes are never stored twice.

Source → Interpretation → Observation → Snapshot. Sources are the leftmost, immutable foundation of the three-layer truth model.

Schema#

sources table (Postgres / hosted)

SQL / TS
Schema or pattern reference for this primitive.
CREATE TABLE sources ( id UUID PRIMARY KEY DEFAULT gen_random_uuid(), content_hash TEXT NOT NULL, storage_url TEXT NOT NULL, storage_status TEXT NOT NULL DEFAULT 'uploaded', mime_type TEXT NOT NULL, file_name TEXT, byte_size INTEGER NOT NULL, source_type TEXT NOT NULL, source_agent_id TEXT, source_metadata JSONB DEFAULT '{}', created_at TIMESTAMPTZ DEFAULT NOW(), user_id UUID NOT NULL, CONSTRAINT unique_content_per_user UNIQUE(content_hash, user_id) );
FieldTypePurpose
idUUIDStable source identifier referenced by every observation, interpretation, and timeline event derived from it
content_hashTEXTSHA-256 of the raw bytes; combined with user_id it is the deduplication key
storage_urlTEXTWhere the bytes actually live (object storage, local disk, …)
storage_statusTEXTuploaded / pending / failed; ingestion uses this to gate downstream interpretation
mime_typeTEXTUsed to choose the right interpreter and to render the source back to humans
byte_sizeINTEGERQuota accounting, integrity sanity-check
source_typeTEXTClassifier (file, http, structured, …) used by the read path and Inspector
source_agent_idTEXTOptional attribution of the writing agent (AAuth tier, clientInfo)
source_metadataJSONBFree-form provenance (URL, headers, capture tool, etc.)
user_idUUIDOwner; combined with content_hash this enforces per-user dedupe and RLS

Per-user content addressing#

Two writes of identical bytes by the same user collapse to a single sources row via the unique (content_hash, user_id) constraint. Two different users uploading the same bytes get two distinct sources rows: deduplication is intentionally not cross-user so privacy boundaries remain intact and per-user storage accounting stays accurate.

Lifecycle#

Sources are created by the ingest path, consumed by zero or more interpretations, and (if the user explicitly deletes) cascade-removed along with their interpretations, observations, and timeline events. Reinterpretation never touches the sources row, it creates a new interpretations row pointing at the same source.

Row-level security#

All downstream reads filter by source_id ∈ caller's owned sources. Even where user_id is denormalised onto downstream rows, the source-scoped filter is the security boundary. Only the MCP server writes sources via service_role; clients never insert directly.

Invariants#

Every source satisfies the following constraints:

MUST

  • Carry a non-null content_hash, byte_size, mime_type, source_type, and user_id
  • Be deduplicated per user, repeat ingest of identical bytes returns the existing row
  • Be referenced by every interpretation, observation, and timeline event derived from them (FK enforced)
  • Be deletable only via explicit user action, which cascades to all derived primitives

MUST NOT

  • Be mutated after upload, bytes and metadata are append-only
  • Be deduped across user boundaries, content addressing is per-user
  • Carry interpreted/extracted data, extraction lives on observations and interpretations
  • Be exposed via APIs that bypass the source-ownership filter

Where to go next#

  • All primitive record types , index of sources, interpretations, observations, relationships, and timeline events
  • Architecture , how the primitives compose into Neotoma's deterministic state
  • Terminology , canonical glossary of terms used across Neotoma docs