NRT Vector Search — Part 4: LanceDB in Production

LanceDB's documentation is clean and well written. It will tell you how to connect, create a table, add records, and run a similarity search. What it won't tell you is what happens when you do all of that continuously, at volume, with a streaming job writing on one cadence and users querying on another.

This post covers the operational reality — the append-only model, fragment accumulation, the two-part write pattern, and what optimize cadence actually looks like when real data is moving through the pipeline.

Unlearn Relational First

The first thing to get right about LanceDB is conceptual, not technical.

If you arrive with a relational mental model — tables, primary keys, upserts, in-place updates — you will immediately reach for operations that don't exist. There is no UPDATE. There is no native upsert. A record that needs to change is a delete and a re-add, in that order, explicitly.

This isn't a limitation. It's a deliberate design decision rooted in the Lance columnar format. Immutable versioning, efficient columnar reads, and the ability to time-travel across table states all depend on append-only semantics. The trade-off is that the relational mental model is actively wrong here and will cost you time if you carry it in.

Reframe early: you are managing embeddings and indexes, not rows and primary keys. Every decision downstream gets cleaner once that shift happens.

The Append-Only Reality in a Streaming Context

In a batch pipeline, append-only is straightforward. You build the index once, query it, rebuild when the corpus changes. Simple.

In a streaming pipeline it gets more interesting. Records are arriving continuously. foreachBatch is calling table.add() on every trigger — every 30 seconds, every minute, whatever your interval is. Each call appends a new fragment to the table.

LanceDB stores data in fragment files. A table with a long write history and no maintenance will accumulate many small fragment files. More fragments means more file handles, more metadata overhead, and degraded query performance as the scan touches more physical files to assemble a result.

Left unmanaged, fragment accumulation is a slow poison. Query latency creeps up. The pipeline keeps running. Nobody notices until the degradation is significant enough to surface in user complaints.

The Two-Part Write

Writing to LanceDB in a streaming context is two distinct operations that should never run on the same schedule.

Part 1 — The write. table.add() appends new records immediately. They are queryable the moment the write completes. This happens on every trigger. It is fast and lightweight by design.

Part 2 — The optimize. table.optimize() does two things: compaction merges accumulated fragment files into larger, more efficient ones, and index rebuild regenerates the ANN index over the compacted data. This is expensive. On a large table with significant fragment accumulation it can take meaningful time. It should never run on every write.

The mistake is coupling these two cadences. A streaming job that calls optimize() inside foreachBatch on every trigger will either slow to a crawl or fail under load. The write cadence and the optimize cadence are separate concerns and need to be managed separately.

Flat Scan: Feature, Not Bug

Records written after the last optimize() call are not yet in the ANN index. When a query arrives, LanceDB handles this automatically — it runs ANN search over the indexed portion of the table and flat scan over the unindexed tail, then merges the results.

Flat scan is brute-force vector comparison. It is O(n) against the unindexed records. For a small recent tail it is fast. For a large unindexed accumulation it degrades query latency meaningfully.

This behavior is intentional. LanceDB guarantees that newly written records are immediately queryable even before the index is rebuilt. The flat scan window is the price of that freshness guarantee. Understanding it means you can reason about the latency profile of your pipeline — ANN latency on the stable corpus, flat scan latency on the recent tail, both contributing to the query response you observe.

The practical implication: your optimize cadence directly controls the size of your flat scan window. Optimize frequently and the unindexed tail stays small. Let optimize fall behind and flat scan latency grows with it.

Decoupling Optimize Cadence

The right architecture separates the optimize operation entirely from the streaming write path.

One clean approach is a separate scheduled job — running on its own cluster and schedule, independent of the streaming pipeline. It connects to the same LanceDB table, calls optimize(), and exits. No shared execution context with the streaming job, no contention.

A coordination mechanism prevents concurrent write and optimize from running simultaneously on the same table:

def run_optimize_if_safe(table):
    if is_write_in_progress():
        log.info("Write in progress — skipping optimize")
        return

    try:
        set_optimize_lock()
        table.optimize()
        clear_optimize_lock()
    except Exception as e:
        log.error(f"Optimize failed: {e}")
        clear_optimize_lock()
        # non-fatal — next scheduled run will retry

The optimize job checks the flag, skips if a write is active, runs when clear. Optimize failures are non-fatal. The table remains fully queryable with a larger flat scan window. Log the failure, monitor it, but don't let it take down the streaming pipeline.

Calibrating the Optimize Schedule

The right optimize cadence depends on your write volume and your acceptable flat scan window. There is no universal answer.

The inputs you need before setting a schedule:

Fragment count growth rate — how many fragments accumulate per hour at your write volume. Instrument this from day one.

Optimize runtime — how long optimize() takes at your table size. This determines the minimum safe interval between runs.

Flat scan latency at N unindexed records — measure it at realistic volumes. This gives you a concrete latency target to optimize against.

Let those three numbers drive the schedule. The right cadence at 100K records is not the right cadence at 10M.

The Index Choice

IVF_HNSW_SQ is the right index choice for production RAG at scale. IVF clusters vectors into partitions to narrow the search space. HNSW navigates those partitions efficiently using a graph structure. SQ applies scalar quantization to compress vectors and reduce memory footprint.

Together they deliver fast approximate nearest neighbor search at memory-efficient scale. Start here if you're building for production. Don't start with flat index and plan to upgrade later — the migration cost at scale is not worth the simplicity savings early.

What to Monitor

Fragment count — track it over time. A rising trend without periodic drops indicates optimize isn't running or isn't keeping pace.

Optimize runtime — a growing trend indicates the table is accumulating more data between runs than optimize can efficiently handle.

Flat scan record count — records written since the last optimize. Direct driver of flat scan window size and query latency on the tail.

Query latency distribution — P95 and P99 are where the signal lives. Flat scan latency shows up in the tail of your distribution before it shows up in the average.

Next: Part 5 — The Full Pipeline End to End. Auto Loader to LanceDB wired together, real latency numbers, and what the tuning decisions look like when all three tools are running simultaneously.

Clarity through the chaos.

Arjun Krishnamoorthi is the founder of LogicLens LLC, a fractional data architecture and AI consulting practice. If you have a data infrastructure problem or an AI project that needs senior hands — let's talk.