Most RAG pipeline content stops at batch. Embed your corpus, build your index, query it. Clean, simple, done.
That's not production. Production has data arriving continuously. Source systems changing. Vectors going stale while users are querying them. The moment you need freshness in your index — real freshness, not a nightly rebuild — you're in near real time territory, and the architecture gets meaningfully more complex.
This post is about what that complexity actually looks like when you wire Auto Loader, Spark Structured Streaming, and LanceDB together. Not the happy path. The seams.
The Stack
The pipeline has a clear left-to-right flow:
Replication tool → ADLS landing zone → Auto Loader → Structured Streaming → foreachBatch → LanceDB
Each arrow is a handoff. Each handoff is a contract between two tools with different assumptions about data, delivery, and failure. The friction lives at those contracts — not inside the tools themselves.
The Upstream Replication Boundary
Before the first line of Spark code runs, your latency ceiling is already set.
Whatever sits upstream — Fivetran, Debezium, AWS DMS, a homegrown CDC job — that tool's replication cadence defines what near real time actually means for your pipeline. You can tune trigger intervals and optimize cadence all day. You cannot beat the frequency of the feed.
Set that expectation early with your stakeholders. NRT in this context means minutes after replication lands, not milliseconds after the source transaction commits. That's an honest and defensible definition. It's also the only one your architecture can actually deliver.
The change event impedance mismatch
Replication tools think in change events — inserts, updates, deletes. Your vector pipeline doesn't consume data the same way.
LanceDB is append-only. There is no native upsert. A source-side update doesn't translate cleanly downstream — it becomes a delete-and-rewrite problem, and if your pipeline doesn't account for that explicitly, you end up with duplicate vectors serving stale content. Retrieval quality degrades quietly. No errors, just wrong answers.
This is the first place where relational thinking will burn you. Reframe early: you're managing embeddings and indexes, not rows and primary keys.
Friction Point 1 — Auto Loader to Structured Streaming
Auto Loader is elegant. It watches a cloud storage path, detects new files, and feeds them into a Structured Streaming job incrementally. The checkpoint tracks exactly where you are. It handles schema inference. On paper it's the cleanest ingestion primitive in the Databricks ecosystem.
In practice, three things will find you.
Schema evolution. Your upstream replication tool lands files with a new column. Auto Loader's inferred schema doesn't include it. Your stream fails. Enable schema evolution mode explicitly — don't discover this in production.
File notification vs directory listing. Auto Loader can detect files two ways: by listing the directory or by subscribing to storage event notifications. Listing is simpler to configure. It's also slower and more expensive at scale. If your landing zone is high-volume, Event Grid notifications aren't optional.
Checkpoint management. The checkpoint is infrastructure, not a side effect. Treat it that way — back it up, monitor it, have a recovery plan. A corrupt checkpoint on a long-running stream means replaying from scratch or accepting a gap. Neither is good.
Friction Point 2 — The foreachBatch Seam
This is the hardest one.
foreachBatch is Spark's escape hatch for writing to arbitrary sinks. It hands you a DataFrame for each micro-batch and lets you do whatever you want with it. What it cannot do is guarantee what happens inside your function.
Spark guarantees delivery to the function. Everything after that is your responsibility.
def write_to_lance(batch_df, batch_id):
records = batch_df.toPandas()
records["vector"] = embed(records) # fails here?
table.add(records) # or here?
# Spark doesn't know either happened
If embedding fails mid-batch, Spark may retry — and now you have a duplicate write problem. If the LanceDB write times out, the checkpoint may still advance, silently dropping records. The function is a black box to the execution engine.
The fix is idempotency. Track processed batch IDs in a small Delta table and check before writing:
def write_to_lance(batch_df, batch_id):
if is_already_processed(batch_id):
return
try:
records = batch_df.toPandas()
records["vector"] = embed(records)
table.add(records)
mark_processed(batch_id)
except Exception as e:
log.error(f"Batch {batch_id} failed: {e}")
raise
Embedding throughput is the hidden bottleneck. Embedding is CPU or GPU bound. If your trigger interval is 30 seconds and embedding takes 45, batches queue faster than they clear. Profile your embedding time against realistic batch sizes before you commit to a trigger interval. Cap batch size with maxFilesPerTrigger if you need to control throughput.
Friction Point 3 — LanceDB Write and Index
LanceDB's append-only design is a feature, not a limitation. It enables immutable versioning and efficient columnar reads. It also means every assumption you've built up about upserts and in-place updates doesn't apply here.
Writing to LanceDB in a streaming context is two distinct operations running at two different cadences.
The write happens on every trigger. foreachBatch generates embeddings and calls table.add(). Records land immediately and are queryable. Fast, lightweight, happens constantly.
The index is a separate operation. table.optimize() merges fragment files and rebuilds the ANN index. This is expensive. It should never run on every write.
The gap between those two cadences is where flat scan comes in. Newly written records that haven't been indexed yet are served via brute-force flat scan. This is intentional LanceDB behavior — the index covers the stable corpus, flat scan covers the recent tail. Size your optimize cadence so the flat scan window stays within acceptable latency bounds for your use case.
Watch for concurrent write and optimize contention. Running both simultaneously on the same table will cause problems. Serialize them.
The Proposed Solution
Understanding the friction points is half the battle. The other half is a set of deliberate architectural decisions that address each one without over-engineering the pipeline.
Replication boundary — set the contract upstream. Before any code gets written, define what NRT means in your specific environment. Get the replication cadence from whoever owns that tool. That number becomes your SLA floor. Document it. Put it in your architecture decision record.
Change event handling — explicit delete and rewrite. Build explicit logic at the Bronze layer to detect update events. A soft delete pattern works cleanly for most mid-market implementations — add an is_deleted flag and a record_version field at Bronze. At Gold, filter on the latest non-deleted version before embedding.
Auto Loader — configure it like infrastructure. Schema evolution mode on from day one. Event Grid notifications if your landing volume justifies it. Checkpoint path in a durable, monitored location.
foreachBatch — idempotency is not optional. Batch ID tracking, check before writing, re-raise exceptions. Three columns in a Delta table. Build it in before the first production write.
LanceDB — decouple write cadence from optimize cadence. Writes happen constantly. Optimize runs on its own schedule driven by observed fragment counts. Never run both concurrently. Let the data tell you when to tighten or loosen the cadence.
This is the first post in "NRT Vector Search: From Spike to Production." The rest of the series goes deeper on each layer — implementation detail, real numbers, and the decisions that don't show up in documentation.
Clarity through the chaos.