This is Post 2 of The Production Ready Series.

"You do not lose the business sponsor conversation in the room. You lose it the day you ship an AI system without observability baked in. By the time they are asking hard questions it is already too late to build the answers."

Built as an Afterthought. Useless When It Matters.

Most AI systems get built in a predictable order. The model first. The pipeline next. The endpoint after that. And somewhere at the end, if there is time, maybe some logging. Maybe a dashboard. Something to show the system is running.

That sequence is a mistake.

When something breaks — and it will — you need answers. Not eventually. Immediately. The business sponsor is not interested in we are investigating. They want to know what happened, which users were affected, how long it lasted, and what you are doing to make sure it never happens again.

If your event layer was an afterthought you cannot answer those questions. Not because you are not good enough. Because you architecturally guaranteed that outcome the day you deprioritized observability. You did not lose that conversation in the room. You lost it six months earlier.

The event taxonomy is not a monitoring feature. It is not a nice-to-have. It is the first class architectural citizen that makes every other part of your AI system defensible in production. Build it that way from day one or pay for it later in a room you do not want to be in.

What First Class Actually Means

First class means the event layer is designed before the application is built — not bolted on after. It means every domain of system behavior has a dedicated table with intentional structure. It means the schema is typed consistently, named deliberately, and built to answer the questions the business will actually ask.

It also means the team treats it with the same rigor as any other production component. Schema reviews. Testing alignment. Version discipline. Not a notebook someone threw together on a Friday.

Here is what first class looks like in practice across six tables.

The Six Tables

ApplicationHealthEvents

This is the front door of your observability layer. Every endpoint called, every request that hit the application, latency recorded, errors captured. If a user had a bad experience this table tells you when it started, which endpoint handled it, how long it took, and whether the system threw an error or just returned a bad result silently.

This is the first table you query when the business asks what happened. It needs to be right. It needs to be complete. It needs to be there every single time a request hits the system — not sampled, not batched, not approximate.

DocProcessingEvents

Every document that enters the system leaves a trail. Ingestion timestamp. Processing steps. Parse attempts. What succeeded. What did not. How long each step took.

This table answers the question nobody thinks to ask until a document goes missing from search results. Where did it go. When did it enter the system. Did processing complete. Was there a silent failure nobody caught because there was no event record to catch it with.

Silent failures in document processing are a production AI system's most dangerous failure mode. They do not throw errors. They just produce missing or degraded results. DocProcessingEvents is how you see them.

PipelineEvents

Databricks gives you a lot natively. Job run history, task durations, retry counts — the UI covers most of what you need for day to day batch operations. But structured, queryable pipeline telemetry in your own layer gives you something the UI does not — the ability to correlate pipeline execution with application behavior.

PipelineEvents is also where batch_id lives for NRT workloads. That identifier is the thread that connects a foreachBatch micro-batch to everything that happened downstream of it. Without it you have two separate stories — the pipeline ran, LanceDB responded — that you cannot connect. batch_id makes that connection explicit, queryable, and permanent.

OperationalMetrics

Not every operational signal fits a rigid schema. LanceDB index state. Blob existence checks. Last capture timestamps. Infrastructure health signals that evolve as the system matures.

OperationalMetrics handles this with a flexible key/value pattern. Metric name in one column. Numeric value in another. New signals get added without a schema change. The table grows with the system rather than fighting it.

This is intentional design for a layer that is still maturing. Flexibility now. Structure where it matters. Rigidity only where the schema is truly stable.

UserAdvancedSearchEvents

What are users actually asking the system to do. Which queries. Which filters. Which search patterns. At what volume and frequency.

This table is operationally undervalued until the moment it is not. When retrieval quality degrades, when certain query patterns return poor results, when usage spikes in unexpected ways — UserAdvancedSearchEvents is where you start. It connects user behavior to system behavior in a way no infrastructure table can.

It also tells you whether the AI system is actually being used the way it was designed to be used. That is a conversation worth having early and often with the business.

HeartbeatEvents

The embarrassing one. The table that should have been in the MVP schema and was not. A regular structured pulse from each service confirming it is running, healthy, and reachable.

No transactions. No user events. Just a timestamp, a service identifier, and a status. Simple enough that it feels beneath the architecture. Critical enough that without it you cannot answer the most basic question a business sponsor can ask — was the system even up when this happened.

It took a schema review to surface it. It goes in v1. First. Before anything else.

The Schema Discipline That Makes It Hold

Consistent typing. Every table speaks the same language. Bigints for identity columns — these systems sit alongside relational source systems with SQL Server lineage and the types need to be compatible end to end. Timestamps are timestamps. Strings are strings. No surprises when you join across tables.

Naming conventions. The _events suffix is intentional. Every table in the taxonomy follows it — except OperationalMetrics which is structurally different by design. Six months from now when someone new is maintaining this layer the naming tells the story without documentation.

ingest_date on every table. Stamped at ingestion. Costs nothing. Gives you a reliable timeline anchor for every record in the system. Small discipline. Enormous value when you need to reconstruct a sequence of events across multiple tables.

Null checks and pipeline stops. Data quality gates baked into the pipeline logic. If thresholds are not met the pipeline stops. Not silently. Not with a warning. It stops. Because a metrics layer fed with bad data is worse than no metrics layer at all — it gives you false confidence at exactly the moment you need accurate answers.

The Schema Review Is Not Optional

A schema that looks right on a whiteboard and a schema that holds up in production are two different things.

The review process is where the gaps surface. Two engineers with different mental models of what the system needs to capture will produce two different schemas. Neither is complete. The review is where they become one.

It is also where the right people get aligned before the layer gets built out further. Testing needs to understand the schema before they write verification logic against it. Not after. The cost of misalignment discovered late is a rewrite nobody has time for.

The HeartbeatEvents table came out of a review. Not a brilliant design insight. A methodical process of asking — what are we missing, what does production accountability actually require, what question can we not currently answer that we should be able to.

Run that process. Run it again when the system matures. The schema conversation is not a one time event.

First Class or Useless. There Is No Middle.

An event layer built as an afterthought gives you the feeling of observability without the substance of it. Tables that exist but are not complete. Coverage that looks right until you actually need it. A metrics layer you cannot trust at the exact moment trust matters most.

The business sponsor conversation is coming. Something will break. Questions will be asked. The only variable is whether you can answer them.

Build the event taxonomy first class from day one. Type it consistently. Name it deliberately. Review it before you ship. Add HeartbeatEvents before you call it v1.

Do that and when the hard questions come you will have hard answers ready.

That is what production ready actually means.

What's Next in This Series

Post 3 — Two Engineers. One Schema. No Shortcuts.: How two engineers align on a shared data contract and why the process matters as much as the output
Post 4 — What Databricks Gives You and What It Doesn't: What Databricks gives you natively and where you still need to build
Post 5 — Follow the Request: Walking a request chain from endpoint to vector store and back
Post 6 — Closing the Loop: Feeding clean signal to your observability platform and making the business conversation possible