Sources
A source is the raw, content-addressed artifact that every other primitive ultimately traces back to: a file you uploaded, a webhook payload, a structured agent write. Sources are deduplicated per user by SHA-256 content hash so the same bytes are never stored twice.
Source → Interpretation → Observation → Snapshot. Sources are the leftmost, immutable foundation of the three-layer truth model.
Schema#
sources table (Postgres / hosted)
| Field | Type | Purpose |
|---|---|---|
id | UUID | Stable source identifier referenced by every observation, interpretation, and timeline event derived from it |
content_hash | TEXT | SHA-256 of the raw bytes; combined with user_id it is the deduplication key |
storage_url | TEXT | Where the bytes actually live (object storage, local disk, …) |
storage_status | TEXT | uploaded / pending / failed; ingestion uses this to gate downstream interpretation |
mime_type | TEXT | Used to choose the right interpreter and to render the source back to humans |
byte_size | INTEGER | Quota accounting, integrity sanity-check |
source_type | TEXT | Classifier (file, http, structured, …) used by the read path and Inspector |
source_agent_id | TEXT | Optional attribution of the writing agent (AAuth tier, clientInfo) |
source_metadata | JSONB | Free-form provenance (URL, headers, capture tool, etc.) |
user_id | UUID | Owner; combined with content_hash this enforces per-user dedupe and RLS |
Per-user content addressing#
Two writes of identical bytes by the same user collapse to a single sources row via the unique (content_hash, user_id) constraint. Two different users uploading the same bytes get two distinct sources rows: deduplication is intentionally not cross-user so privacy boundaries remain intact and per-user storage accounting stays accurate.
Lifecycle#
Sources are created by the ingest path, consumed by zero or more interpretations, and (if the user explicitly deletes) cascade-removed along with their interpretations, observations, and timeline events. Reinterpretation never touches the sources row, it creates a new interpretations row pointing at the same source.
Row-level security#
All downstream reads filter by source_id ∈ caller's owned sources. Even where user_id is denormalised onto downstream rows, the source-scoped filter is the security boundary. Only the MCP server writes sources via service_role; clients never insert directly.
Invariants#
Every source satisfies the following constraints:
MUST
- Carry a non-null content_hash, byte_size, mime_type, source_type, and user_id
- Be deduplicated per user, repeat ingest of identical bytes returns the existing row
- Be referenced by every interpretation, observation, and timeline event derived from them (FK enforced)
- Be deletable only via explicit user action, which cascades to all derived primitives
MUST NOT
- Be mutated after upload, bytes and metadata are append-only
- Be deduped across user boundaries, content addressing is per-user
- Carry interpreted/extracted data, extraction lives on observations and interpretations
- Be exposed via APIs that bypass the source-ownership filter
Related#
- Sources subsystem doc , Full source-and-interpretation lifecycle, MCP tools, quota model
- Interpretations , Versioned extraction attempts that consume a source
- Observations , Granular facts produced from a source via an interpretation
- Timeline events , Source-anchored temporal records derived from extracted dates
- Determinism doctrine , Where sources sit on the deterministic-vs-non-deterministic boundary
Where to go next#
- All primitive record types , index of sources, interpretations, observations, relationships, and timeline events
- Architecture , how the primitives compose into Neotoma's deterministic state
- Terminology , canonical glossary of terms used across Neotoma docs