Sources

A source is the raw, content-addressed artifact that every other primitive ultimately traces back to: a file you uploaded, a webhook payload, a structured agent write. Sources are deduplicated per user by SHA-256 content hash so the same bytes are never stored twice.

Source → Interpretation → Observation → Snapshot. Sources are the leftmost, immutable foundation of the three-layer truth model.

Schema#

sources table (Postgres / hosted)

SQL / TS

Schema or pattern reference for this primitive.

CREATE TABLE sources (
  id UUID PRIMARY KEY DEFAULT gen_random_uuid(),
  content_hash TEXT NOT NULL,
  storage_url TEXT NOT NULL,
  storage_status TEXT NOT NULL DEFAULT 'uploaded',
  mime_type TEXT NOT NULL,
  file_name TEXT,
  byte_size INTEGER NOT NULL,
  source_type TEXT NOT NULL,
  source_agent_id TEXT,
  source_metadata JSONB DEFAULT '{}',
  created_at TIMESTAMPTZ DEFAULT NOW(),
  user_id UUID NOT NULL,
  CONSTRAINT unique_content_per_user UNIQUE(content_hash, user_id)
);

Field	Type	Purpose
`id`	`UUID`	Stable source identifier referenced by every observation, interpretation, and timeline event derived from it
`content_hash`	`TEXT`	SHA-256 of the raw bytes; combined with user_id it is the deduplication key
`storage_url`	`TEXT`	Where the bytes actually live (object storage, local disk, …)
`storage_status`	`TEXT`	uploaded / pending / failed; ingestion uses this to gate downstream interpretation
`mime_type`	`TEXT`	Used to choose the right interpreter and to render the source back to humans
`byte_size`	`INTEGER`	Quota accounting, integrity sanity-check
`source_type`	`TEXT`	Classifier (file, http, structured, …) used by the read path and Inspector
`source_agent_id`	`TEXT`	Optional attribution of the writing agent (AAuth tier, clientInfo)
`source_metadata`	`JSONB`	Free-form provenance (URL, headers, capture tool, etc.)
`user_id`	`UUID`	Owner; combined with content_hash this enforces per-user dedupe and RLS

Per-user content addressing#

Two writes of identical bytes by the same user collapse to a single sources row via the unique (content_hash, user_id) constraint. Two different users uploading the same bytes get two distinct sources rows: deduplication is intentionally not cross-user so privacy boundaries remain intact and per-user storage accounting stays accurate.

Lifecycle#

Sources are created by the ingest path, consumed by zero or more interpretations, and (if the user explicitly deletes) cascade-removed along with their interpretations, observations, and timeline events. Reinterpretation never touches the sources row, it creates a new interpretations row pointing at the same source.

Row-level security#

All downstream reads filter by source_id ∈ caller's owned sources. Even where user_id is denormalised onto downstream rows, the source-scoped filter is the security boundary. Only the MCP server writes sources via service_role; clients never insert directly.

Invariants#

Every source satisfies the following constraints:

MUST

Carry a non-null content_hash, byte_size, mime_type, source_type, and user_id
Be deduplicated per user, repeat ingest of identical bytes returns the existing row
Be referenced by every interpretation, observation, and timeline event derived from them (FK enforced)
Be deletable only via explicit user action, which cascades to all derived primitives

MUST NOT

Be mutated after upload, bytes and metadata are append-only
Be deduped across user boundaries, content addressing is per-user
Carry interpreted/extracted data, extraction lives on observations and interpretations
Be exposed via APIs that bypass the source-ownership filter

Sources subsystem doc , Full source-and-interpretation lifecycle, MCP tools, quota model
Interpretations , Versioned extraction attempts that consume a source
Observations , Granular facts produced from a source via an interpretation
Timeline events , Source-anchored temporal records derived from extracted dates
Determinism doctrine , Where sources sit on the deterministic-vs-non-deterministic boundary

Where to go next#

All primitive record types , index of sources, interpretations, observations, relationships, and timeline events
Architecture , how the primitives compose into Neotoma's deterministic state
Terminology , canonical glossary of terms used across Neotoma docs