High-velocity intake gives firehose sources — news feeds, social streams, webhooks, log tails — a first-class path into Neotoma that keeps the knowledge graph clean. Two mechanisms work together: an overflow sink that discards-by-default at write time, and canonical key sightings that aggregate duplicates at read time without ever merging or deleting stored rows.
The problem it solves
A firehose source produces more events than belong in a graph. Storing every raw event creates noise, pollutes deduplication, and buries signal under volume. The prior approach — filtering outside Neotoma and only storing "interesting" events — puts domain logic in the agent and makes replay impossible.
High-velocity intake inverts this: write everything to the overflow sink (fast, disk-only, no graph touch), then surface aggregated signal through canonical_key sightings at read time. The graph stays clean; the full audit trail is on disk.
Overflow sink
The store tool accepts an optional intake object:
{
"intake": {
"mode": "overflow",
"reason": "raw twitter stream event"
},
"entities": [{ ... }]
}
When mode is "overflow", the store call:
- Appends the raw payload as a JSONL line to the configured sink file — no entity or observation is created.
- Returns an
OverflowReceiptimmediately:{ overflowed: true, sink_path, line_offset }. - Never touches the graph — dedup, schema validation, and observation indexing are all skipped.
The default mode is "graph" (today's behavior). All existing store calls that omit intake are unchanged.
Configuring the sink
Set NEOTOMA_OVERFLOW_SINK to an absolute path:
- Directory path — Neotoma writes daily files:
overflow-YYYY-MM-DD.jsonlinside the directory. .jsonlpath — Neotoma writes to that exact file.
If NEOTOMA_OVERFLOW_SINK is unset and a caller passes intake.mode: "overflow", the call returns a structured hint rather than writing or throwing:
{
"error": "OVERFLOW_SINK_NOT_CONFIGURED",
"hint": "Set NEOTOMA_OVERFLOW_SINK env var to an absolute directory path..."
}
When to use overflow mode
- You receive events faster than they can be meaningfully evaluated
- You want a complete audit trail on disk but only a filtered subset in the graph
- You need replay capability: process the JSONL offline, then
storethe interesting events withmode: "graph"
Canonical key sightings
For sources that publish the same logical event multiple times (multiple aggregators reporting the same news item, retries, mirror feeds), canonical_key and sighting_source_id let you store each sighting as its own row and aggregate them at read time — without any write-path merge.
Storing sightings
Include canonical_key and sighting_source_id in each entity's snapshot (or as top-level observation fields). Both are nullable additive columns: existing rows without them are unaffected.
canonical_key— a stable identifier for the logical event (e.g. the article URL, the tweet ID, the order number)sighting_source_id— which feed or source produced this row (e.g."reuters","ap","bloomberg")
Each (sighting_source_id, canonical_key) pair is an idempotent row. Re-storing the same sighting overwrites (updates) the existing row; it does not create a duplicate.
Reading with collapse_by: "canonical_key"
Pass collapse_by: "canonical_key" to retrieve_entities to aggregate sightings at read time:
{
"entity_type": "news_item",
"collapse_by": "canonical_key"
}
The response groups rows that share the same canonical_key into one synthesized result per key:
{
"entities": [
{
"canonical_key": "https://example.com/article/123",
"sightings": [
{ "sighting_source_id": "reuters", "entity_id": "ent_aaa" },
{ "sighting_source_id": "ap", "entity_id": "ent_bbb" }
],
"significance": 0.9,
"snapshot": { ... }
}
],
"collapsed_by": "canonical_key"
}
Underlying rows are never merged or deleted. Collapse is a view operation at read time, so a dedup bug cannot corrupt stored data. Entities without a canonical_key are returned as-is alongside collapsed groups.
Upgrade notes
Both mechanisms are fully backward-compatible:
intakedefaults to{ mode: "graph" }when omitted — no behavior change for existing callers.canonical_keyandsighting_source_idare nullable additive columns — existing observation rows getNULLand are unchanged.collapse_bydefaults to no collapsing when omitted.
The canonical_key index is created automatically on startup via the existing addColumnIfMissing migration pattern.
Shipped in v0.17.0 (PRs #1604, #1697). See the
changelog
for release notes. For embedding the Inspector graph in your product, see
embeddable graph
.