feat(ai-bot): add a flag-gated, provider-neutral answer cascade with a reserved-USD ledger and request telemetry

2026-06-01 18:06:15 +03:00 · 2026-06-01 18:06:15 +03:00 · ff8918dae1
commit ff8918dae1
parent f7f6984d18
24 changed files with 2970 additions and 313 deletions
--- a/apps/ai-bot/README.md
+++ b/apps/ai-bot/README.md
@ -32,12 +32,20 @@ apps/ai-bot/
 ├── registration.go # generate + read registration.yaml (tokens, mautrix idiom)
 ├── events.go       # Matrix event types + decoders
 ├── mentions.go     # m.mentions + pill/reply fallbacks (F29/F30)
-├── context.go      # xAI message-window assembly (trigger + bot replies)
+├── context.go      # provider-neutral message-window assembly (trigger + bot replies)
-├── xai.go          # chat/completions client + retry (F6)
+├── llm.go          # provider-neutral types + LLMClient interface (no vendor names)
-├── store.go        # Postgres (vojo_ai): spend ledger, txn/event dedup, encrypted-warned set
+├── httpllm.go      # shared OpenAI-compatible chat/completions transport + retry (F6)
-├── messages.go     # bot-authored RU notices
+├── provider_xai.go # thin xAI/Grok adapter over the shared transport
 ├── provider_gemini.go # Gemini adapter: OpenAI-compat client + native v1beta grounding
 ├── pricing.go      # per-model price table (priceFor) + CostBreakdown
 ├── router.go       # cascade router: Layer-0 heuristic + optional Layer-1 Gemini classifier
 ├── cascade.go      # generate(): route dispatch with degrade-to-grok_direct
 ├── web.go          # WebProvider: grok_web_search (Live Search) | gemini_grounding + cap guard
 ├── telemetry.go    # request_log analytics row + async emit + retention trim
 ├── store.go        # Postgres (vojo_ai): spend ledger (+reservation/components), dedup, request_log, grounding cap
 ├── messages.go     # language-free emoji status reactions
 ├── markdown.go     # markdown → org.matrix.custom.html for the reply's formatted_body
-├── util.go         # bounded dedup set
+├── util.go         # bounded dedup set + small hash
 ├── prompts/system_ru.txt
 ├── Dockerfile      # CGO-free static build → distroless, EXPOSE 8009
 └── .env.example
@ -49,8 +57,15 @@ All via environment (see `.env.example`). Required: `HOMESERVER_URL`, `BOT_MXID`
 `AS_TOKEN`, `HS_TOKEN`, `XAI_API_KEY`, `ALLOWED_SERVERS`, `AI_BOT_DATABASE_URL`.
 `AS_ADDR` (default `:8009`) is the transaction-push listen address — it must match
 the `url` port in the registration. The model is env-configurable (`XAI_MODEL`,
-default `grok-4.20-0309-non-reasoning`; `grok-4.3` is an alternative — **re-verify
+default `grok-4.20-0309-non-reasoning`).
-the id + price on docs.x.ai before deploy**).
+
 `grok-4.3` is the newer unified model (same price, 1M context): one model with a
 `reasoning_effort` dial. If you switch `XAI_MODEL=grok-4.3`, set
 `GROK_REASONING_EFFORT=none` to keep the default voice fast/cheap — otherwise the API
 defaults to `low` and reasons on **every** reply. `GROK_REASONING_EFFORT` (accepted:
 `none|low|medium|high`, default empty = not sent) is applied to the normal Grok voice
 (grok_direct + web synthesis); leave it **empty** for `grok-4.20-non-reasoning`, which
 rejects the param. The reason_then_grok route always uses `high` regardless.
 ### Database
@ -78,8 +93,49 @@ AI_BOT_DATABASE_URL=postgres://vojo_ai:<secret>@postgres:5432/vojo_ai?sslmode=di
 ```
 The hard USD ceiling is priced from the **API-returned token usage** times the
-configured `XAI_PRICE_*_PER_M` fallbacks, so a price change only needs those
+per-model price table (`XAI_PRICE_*_PER_M`, `GEMINI_PRICE_*_PER_M`), so a price
-constants updated — it can't silently blow the cap.
+change only needs those constants updated — it can't silently blow the cap. The
 ceiling is enforced with an optimistic **reservation** (`reserved_usd`): a request's
 estimated max-cost is booked at admission and settled to the real cost afterward, so
 a burst of concurrent requests can't slip past `DAILY_USD_CEILING` (it would
 otherwise, since the USD only lands after each call).
 ### Operator accounting (Phase 1, on by default)
 - `REQUEST_BUDGET_SECONDS` (default 180) — overall per-request deadline shared by all
  model calls, so a slow/retried call (or a cascade) can't accrete minutes.
 - `GROK_PROMPT_CACHE` (default false) — Grok caches prompt prefixes automatically; this
  toggle only adds the `x-grok-conv-id` routing header (a per-room id) to raise the
  cache hit rate. There is no `prompt_cache` body param (verified on docs.x.ai).
 - `TELEMETRY_ENABLED` (default false) — write a `request_log` analytics row per engaged
  request (route, per-component $, latency, degrade/ceiling reasons). The write is async
  and isolated — its failure never drops a reply. `TELEMETRY_STORE_TEXT` (default false)
  additionally keeps the query text (for offline eval); `TELEMETRY_RETENTION_DAYS`
  (default 30) time-trims old rows. Turn telemetry on to MEASURE the base before enabling
  any cascade layer.
 ### Cascade (Phase 2-4) — behind flags, **default OFF** (every layer off == today's bot)
 All optional; an unset env is exactly today's single grok_direct call. Any layer off or
 failing **degrades to grok_direct** (never silence). Do **not** enable in prod until the
 offline-eval gate (misroute < 2-3% AND measured saving > the second provider's cost; see
 `docs/plans/ai_backend_build_plan.md` §9).
 | Env | Default | Meaning |
 |---|---|---|
 | `ROUTER_ENABLED` | false | Layer-0 heuristic router (else everything → grok_direct) |
 | `ROUTER_CLASSIFIER_ENABLED` | false | Layer-1 Gemini classifier on uncertain cases (requires `ROUTER_ENABLED` + Gemini key) |
 | `TRIVIAL_OFFLOAD_ENABLED` | false | answer trivial messages with Gemini (requires Gemini key) |
 | `WEB_ENABLED` | false | web_then_grok route (Gemini/Grok fetches fresh facts, **Grok stays the voice**) |
 | `WEB_PROVIDER` | `grok_web_search` | `grok_web_search` (xAI Agent Tools `web_search` on the Responses API, $5/1k calls, no Gemini key) or `gemini_grounding` (**cheapest**: Gemini does the fetch via native v1beta `google_search`, Grok voices it — ~$0.0013/query, validated on `gemini-2.5-flash-lite`; the F-EXT-3 "Gemini-3 only" caveat is the OpenAI-compat endpoint, native v1beta works on 2.5). Requires `GEMINI_API_KEY`. |
 | `WEB_GROUNDING_DAILY_CAP` | 450 | durable per-day cap for `gemini_grounding` before degrading (keep < the 500/day free grounding RPD; guards the per-1k overage) |
 | `REASONING_ENABLED` | false | manual "think harder" route on `REASONING_TRIGGER` |
 | `REASONING_TRIGGER` | `подумай глубже` | trigger phrase |
 | `REASONING_MODEL` | `grok-4.3` | a **reasoning-capable** model (the default `grok-4.20-non-reasoning` rejects `reasoning_effort`) |
 | `REASONING_EFFORT` | `high` | the reasoning_effort the "think harder" route sends (`none｜low｜medium｜high`) |
 | `GEMINI_API_KEY` / `_FILE` | — | required only when a Gemini-using layer is on (fail-fast at startup otherwise) |
 | `GEMINI_MODEL` | `gemini-2.5-flash-lite` | cheap model for trivial/classifier |
 | `GEMINI_BASE_URL` | `…/v1beta/openai` | OpenAI-compat endpoint (native grounding endpoint derived from it) |
 ## One-time setup (appservice registration)
--- a/apps/ai-bot/appservice_test.go
+++ b/apps/ai-bot/appservice_test.go
@ -32,7 +32,7 @@ func openTestStore(t *testing.T) *Store {
 	}
 	ctx, cancel := opContext()
 	defer cancel()
-	if _, err := st.pool.Exec(ctx, `TRUNCATE processed_txn, processed_event, spend, warned_encrypted`); err != nil {
+	if _, err := st.pool.Exec(ctx, `TRUNCATE processed_txn, processed_event, spend, warned_encrypted, request_log, grounding_count`); err != nil {
 		st.Close()
 		t.Fatalf("truncate test tables: %v", err)
 	}
--- a/apps/ai-bot/bot.go
+++ b/apps/ai-bot/bot.go
@ -5,6 +5,7 @@ import (
 	"fmt"
 	"log/slog"
 	"sync"
 	"sync/atomic"
 	"time"
 )
@ -30,9 +31,22 @@ type Bot struct {
 	cfg *Config
 	log *slog.Logger
 	mx  *MatrixClient
-	xai *XAIClient
+	llm LLMClient
 	st  *Store
 	// gemini is the cheap chat backend for the trivial route and the Layer-1 classifier
 	// (an LLMClient so tests can fake it); nil unless a layer that uses it is enabled.
 	// web is the web-freshness provider, built only when WEB_ENABLED. Both nil → the
 	// cascade can only ever produce grok_direct.
 	gemini LLMClient
 	web    WebProvider
 	// promptVersion is a short stable hash of the system prompt, logged with each
 	// request so prompt changes are visible in the analytics (A/B + regressions).
 	promptVersion string
 	// telemetryWrites paces the retention trim (every telemetryTrimEvery writes).
 	telemetryWrites atomic.Uint64
 	// mu guards the in-memory maps/sets below. Each transaction is acked to Synapse
 	// immediately (appservice.go) and its events are processed in transaction order,
 	// but the slow xAI generation runs in a per-room goroutine and the lazy probes run
@ -49,7 +63,7 @@ type Bot struct {
 func NewBot(ctx context.Context, cfg *Config, logger *slog.Logger) (*Bot, error) {
 	mx := NewMatrixClient(cfg.HomeserverURL, cfg.ASToken, cfg.BotMXID)
-	xai := NewXAIClient(cfg.XAIBaseURL, cfg.XAIAPIKey, logger)
+	llm := NewXAIClient(cfg.XAIBaseURL, cfg.XAIAPIKey, logger)
 	st, err := OpenStore(cfg.DatabaseURL)
 	if err != nil {
@ -57,16 +71,35 @@ func NewBot(ctx context.Context, cfg *Config, logger *slog.Logger) (*Bot, error)
 	}
 	b := &Bot{
-		cfg:      cfg,
+		cfg:           cfg,
-		log:      logger,
+		log:           logger,
-		mx:       mx,
+		mx:            mx,
-		xai:      xai,
+		llm:           llm,
-		st:       st,
+		st:            st,
-		seen:     newLRUSet(5000),
+		promptVersion: fmt.Sprintf("%08x", hashString(cfg.SystemPrompt)),
-		botSent:  newLRUSet(5000),
+		seen:          newLRUSet(5000),
-		meta:     make(map[string]*roomMeta),
+		botSent:       newLRUSet(5000),
-		buf:      make(map[string][]bufferedMsg),
+		meta:          make(map[string]*roomMeta),
-		inflight: make(map[string]bool),
+		buf:           make(map[string][]bufferedMsg),
 		inflight:      make(map[string]bool),
 	}
 	// Build the cascade backends only for enabled layers (config already fail-fast
 	// validated that the keys exist). With every cascade flag off these stay nil and
 	// generate() can only produce grok_direct — today's bot. The grounding web provider
 	// needs the concrete client (for the native generateContent call), so keep a typed
 	// handle alongside the LLMClient face.
 	var gc *geminiClient
 	if cfg.needsGemini() {
 		gc = NewGeminiClient(cfg.GeminiBaseURL, cfg.GeminiAPIKey, cfg.GeminiModel, logger)
 		b.gemini = gc
 	}
 	if cfg.WebEnabled {
 		if cfg.WebProvider == webProviderGeminiGrounding {
 			b.web = &geminiGrounding{gem: gc, st: st, cfg: cfg}
 		} else {
 			b.web = newGrokWebSearch(cfg, logger)
 		}
 	}
 	// Confirm the as_token + user_id resolves to BOT_MXID before serving.
@ -223,7 +256,11 @@ func (b *Bot) handleMessage(ctx context.Context, ev *Event) {
 	// bot can't read it. The probe runs without the lock.
 	if b.ensureEncryption(ctx, roomID) {
 		b.log.Debug("skip: encrypted room", "room", roomID)
-		b.reactEncryptedOnce(ctx, roomID, ev.EventID)
+		// Log the skip only when we actually react (once per room), so an encrypted room
 		// the bot can't read doesn't flood request_log with one row per message.
 		if b.reactEncryptedOnce(ctx, roomID, ev.EventID) {
 			b.recordSkip(ev, degradeEncrypted)
 		}
 		return
 	}
@ -255,6 +292,7 @@ func (b *Bot) handleMessage(ctx context.Context, ev *Event) {
 	// "leak" the bot into) a federated room with non-consenting third parties.
 	if foreign {
 		b.leaveForeign(ctx, roomID)
 		b.recordSkip(ev, degradeForeign)
 		return
 	}
@ -282,6 +320,7 @@ func (b *Bot) handleMessage(ctx context.Context, ev *Event) {
 	if isMedia {
 		b.log.Debug("skip: non-text msgtype (reacted)", "room", roomID, "sender", ev.Sender, "msgtype", mc.MsgType)
 		b.react(ctx, roomID, ev.EventID, reactMedia)
 		b.recordSkip(ev, degradeMedia)
 		return
 	}
@ -312,20 +351,46 @@ func (b *Bot) handleMessage(ctx context.Context, ev *Event) {
 const unlimitedCap = 1 << 30
 func (b *Bot) respond(ctx context.Context, roomID string, isDM bool, ev *Event, mc *MessageContent, history []bufferedMsg) {
 	started := time.Now()
 	// One telemetry row per request, populated as the flow decides its outcome and
 	// emitted once via defer — so every exit (deny, error, empty, paid silence, success)
 	// is recorded without scattering writes (F-FUNC-5). It starts as route=none/ok=false;
 	// proceeding to the model sets the route, success sets ok=true.
 	rl := RequestLog{
 		ID: ev.EventID, RoomID: roomID, Sender: ev.Sender,
 		Route: routeNone, RouterSource: "default",
 		PromptVersion: b.promptVersion,
 		QueryText:     mc.Body,
 		Models:        map[string]string{"final": b.cfg.XAIModel},
 	}
 	defer func() {
 		rl.LatencyMS = int(time.Since(started).Milliseconds())
 		b.recordTelemetry(rl)
 	}()
 	perUserCap := b.cfg.PerUserDailyCap
 	perUserUSD := b.cfg.PerUserDailyUSD
 	if b.cfg.UnlimitedUsers[ev.Sender] {
 		perUserCap = unlimitedCap
 		perUserUSD = 0 // exempt from both per-user gates; the global ceiling still applies
 	}
-	switch res, err := b.st.Reserve(ev.Sender, perUserCap, b.cfg.DailyUSDCeiling); {
+	// Reserve the route's estimated max-cost (not $0) so the global ceiling counts
 	// this in-flight call BEFORE it returns — the TOCTOU fix (§8.1). The envelope covers
 	// the most expensive ENABLED route, so whichever the router picks is admitted within
 	// the reservation; with the cascade off it is exactly grok_direct's estimate.
 	estimate := b.reserveEstimate()
 	switch res, err := b.st.Reserve(ev.Sender, perUserCap, perUserUSD, b.cfg.DailyUSDCeiling, estimate); {
 	case err != nil:
 		// A limiter failure is on our side — don't leave the user wondering.
 		b.log.Error("limiter reserve failed", "sender", ev.Sender, "err", err)
 		rl.Degraded, rl.Err = degradeReserveErr, err.Error()
 		b.react(ctx, roomID, ev.EventID, reactError)
 		return
 	case res == reserveDeniedUser:
 		// Per-user cap (anti-abuse, F24): stop answering, but always signal the limit —
 		// no message addressed to the bot is left without feedback.
 		b.log.Info("per-user daily cap reached; reacting", "sender", ev.Sender)
 		rl.PerUserCapHit = true
 		b.react(ctx, roomID, ev.EventID, reactRateLimit)
 		return
 	case res == reserveDeniedGlobal:
@ -333,23 +398,80 @@ func (b *Bot) respond(ctx context.Context, roomID string, isDM bool, ev *Event,
 		// once-per-day text notice), so signal every affected message rather than
 		// going silent after the first.
 		b.log.Warn("global daily USD ceiling reached", "room", roomID, "sender", ev.Sender)
 		rl.CeilingHit = true
 		b.react(ctx, roomID, ev.EventID, reactRateLimit)
 		return
 	}
 	// Past admission, a reservation + request slot are held. Guarantee they're freed on
 	// ANY exit that didn't settle — including a panic in generate() (recovered by safego)
 	// — so a leaked reservation can never permanently drift the global ceiling down
 	// (§8.1c). The normal paths set settled=true at Settle, so this defer then no-ops; it
 	// fires only on the panic/unexpected-return path, where it also reacts so the failure
 	// isn't silent.
 	settled := false
 	defer func() {
 		if settled {
 			return
 		}
 		rl.Degraded, rl.Err = "panic", "generation panicked or returned without settling"
 		if rerr := b.st.ReleaseReservation(ev.Sender, estimate); rerr != nil {
 			b.log.Error("release reservation (unsettled) failed", "sender", ev.Sender, "err", rerr)
 		}
 		if rerr := b.st.RefundRequest(ev.Sender); rerr != nil {
 			b.log.Error("refund (unsettled) failed", "sender", ev.Sender, "err", rerr)
 		}
 		b.react(ctx, roomID, ev.EventID, reactError)
 	}()
 	// Show "Vojo AI печатает…" for the whole generation. The keepalive refreshes the
 	// typing notification every 20s (the server expires it after 30s) so the indicator
 	// never lapses on a slow/retried answer, and the deferred stop clears it on exit.
 	stopTyping := b.startTypingKeepalive(ctx, roomID)
 	defer stopTyping()
-	msgs := buildContext(b.cfg.SystemPrompt, history, isDM, mc.Body, b.cfg.MaxCtxEvent, 8000)
+	// Overall per-request deadline (§8.2.2): every model call in the cascade shares this
-	resp, err := b.xai.Complete(ctx, b.cfg.XAIModel, msgs, b.cfg.MaxOutTok, b.cfg.XAITemp)
+	// single budget (genCtx), so a multi-stage route can't accrete minutes the way
 	// per-stage 3×60s retries would. react/send/store ops use the live room ctx, NOT
 	// genCtx, so a budget timeout still surfaces as a ⚠️ react, never silence.
 	genCtx, cancel := context.WithTimeout(ctx, b.cfg.RequestBudget)
 	defer cancel()
 	msgs := buildContext(b.cfg.SystemPrompt, history, isDM, mc.Body, b.cfg.MaxCtxEvent, maxPromptTokens)
 	res, err := b.generate(genCtx, mc.Body, msgs, b.convID(roomID))
 	// Record what the routing + generation actually did, whatever the outcome.
 	rl.Route = res.route
 	rl.RouterSource = res.decision.Source
 	rl.RouterConfidence = res.decision.Confidence
 	rl.FallbackFired = res.fallback
 	rl.Escalated = res.route == routeReason
 	rl.Cost = res.cost
 	if res.stageMS != nil {
 		rl.StageMS = res.stageMS
 	}
 	if res.finalModel != "" {
 		rl.Models["final"] = res.finalModel
 	}
 	if res.decision.Source == "classifier" {
 		rl.Models["router"] = b.cfg.GeminiModel
 	}
 	if res.degraded != "" {
 		rl.Degraded = res.degraded
 	}
 	if err != nil {
-		// at-most-once already retried transient failures inside Complete; refund the
+		// Terminal: even grok_direct failed. Settle whatever the cascade ACTUALLY spent
-		// reserved request so an xAI outage doesn't burn the user's daily cap, and
+		// (e.g. a paid web fetch before the failure) and release the rest of the
-		// signal the failure (react → no anti-loop, no language).
+		// reservation in one step, then refund the request slot so an outage doesn't burn
-		b.log.Error("xai completion failed", "sender", ev.Sender, "err", err)
+		// the cap, and react (never silent). Settle with an all-zero cost is just a
 		// release, so a pure grok_direct failure books nothing — exactly as before.
 		b.log.Error("generation failed", "sender", ev.Sender, "route", res.route, "err", err)
 		rl.Err = err.Error()
 		if serr := b.st.Settle(ev.Sender, estimate, res.cost); serr != nil {
 			b.log.Error("settle (failed request) failed", "sender", ev.Sender, "err", serr)
 		}
 		settled = true
 		if rerr := b.st.RefundRequest(ev.Sender); rerr != nil {
 			b.log.Error("refund failed", "sender", ev.Sender, "err", rerr)
 		}
@ -357,39 +479,86 @@ func (b *Bot) respond(ctx context.Context, roomID string, isDM bool, ev *Event,
 		return
 	}
-	// A 2xx from xAI is billed even if the text came back empty — always book the real
+	// Success from some route. Settle: release the reservation and book the real
-	// cost so both caps see it (an empty 200 must not bypass the per-user cap and the
+	// per-component cost, so both caps see grounding/tool fees too — not just tokens.
-	// global ceiling).
+	if err := b.st.Settle(ev.Sender, estimate, res.cost); err != nil {
-	usd := computeUSD(resp.Usage, b.cfg)
+		b.log.Error("settle spend failed", "sender", ev.Sender, "err", err)
 	if err := b.st.Reconcile(ev.Sender, usd); err != nil {
 		b.log.Error("reconcile spend failed", "sender", ev.Sender, "err", err)
 	}
 	settled = true
 	rl.PromptTokens, rl.CachedTokens, rl.CompletionTokens =
 		res.usage.PromptTokens, res.usage.CachedTokens, res.usage.CompletionTokens
 	rl.CacheHit = res.usage.CachedTokens > 0
 	rl.ProviderRequestID = res.providerID
-	text := resp.Text()
+	text := res.text
 	if text == "" {
 		// Billed but no usable text (content filter / length cap / empty choices). Never
-		// leave a billed request without feedback — react "couldn't answer".
+		// leave a billed request without feedback — react "couldn't answer". The slot
-		b.log.Warn("xai returned empty completion (billed, reacting)", "sender", ev.Sender, "usd", usd)
+		// stays consumed (the 2xx was real); no refund, or an empty reply could be forced
 		// to dodge the cap.
 		b.log.Warn("empty completion (billed, reacting)", "sender", ev.Sender, "usd", res.cost.Total())
 		rl.Degraded = degradeEmpty
 		b.react(ctx, roomID, ev.EventID, reactError)
 		return
 	}
-	b.log.Info("answered", "room", roomID, "sender", ev.Sender, "dm", isDM,
+	b.log.Info("answered", "room", roomID, "sender", ev.Sender, "dm", isDM, "route", res.route,
-		"usd", usd, "prompt_tokens", resp.Usage.PromptTokens, "completion_tokens", resp.Usage.CompletionTokens)
+		"usd", res.cost.Total(), "prompt_tokens", res.usage.PromptTokens, "completion_tokens", res.usage.CompletionTokens)
-	b.sendReply(ctx, roomID, ev, mc, text)
+	if err := b.sendReply(ctx, roomID, ev, mc, text); err != nil {
 		// Paid silence (§8.1): the spend is real (USD is kept — refunding it would
 		// under-count the ceiling), but the reply never landed. Refund the request SLOT
 		// so the user can retry, and react ⚠️ so the failure isn't silent.
 		b.log.Error("send reply failed after billing; refunding slot + reacting", "sender", ev.Sender, "err", err)
 		rl.Degraded, rl.Err = degradeSendFailed, err.Error()
 		if rerr := b.st.RefundRequest(ev.Sender); rerr != nil {
 			b.log.Error("refund failed", "sender", ev.Sender, "err", rerr)
 		}
 		b.react(ctx, roomID, ev.EventID, reactError)
 		return
 	}
 	rl.OK = true
 }
-// computeUSD prices the call from the API-returned token usage (authoritative
+// maxPromptTokens bounds the assembled prompt (history is trimmed to fit) and feeds
-// counts) and the configured per-1M prices — so the hard ceiling tracks real
+// the reservation estimate, so the two never disagree about a request's size.
-// usage even if the model/price changes (only the constants need updating).
+const maxPromptTokens = 8000
-func computeUSD(u xaiUsage, cfg *Config) float64 {
+
-	cached := u.PromptTokensDetails.CachedTokens
+// estimateUSD is the conservative max-cost reserved for a route before the call, so
-	nonCached := u.PromptTokens - cached
+// the global ceiling can count an in-flight request (§8.1). It prices a full prompt
 // (maxPromptTokens) plus the max output at the model's non-cached rates — an upper-ish
 // bound, since real calls send fewer tokens and get the cheaper cached rate. Settle
 // later books the authoritative actual cost regardless, so a slightly-off estimate
 // only nudges admission, never the final accounting.
 func (b *Bot) estimateUSD(model string) float64 {
 	p := b.cfg.priceFor(model)
 	return float64(maxPromptTokens)/1e6*p.InputPerM + float64(b.cfg.MaxOutTok)/1e6*p.OutputPerM
 }
 // convID returns the prompt-cache routing hint sent as x-grok-conv-id, or "" when
 // GROK_PROMPT_CACHE is off. Grok caches prompt prefixes automatically; the header
 // only pins a conversation to the same backend to raise the hit rate (docs.x.ai), so
 // a stable per-room id is the right unit — every turn in a room shares the system
 // prompt and history prefix. It carries no PII (the room id is opaque) and is hashed
 // to keep it compact and non-identifying.
 func (b *Bot) convID(roomID string) string {
 	if !b.cfg.GrokPromptCache {
 		return ""
 	}
 	return fmt.Sprintf("vojo-%08x", hashString(roomID))
 }
 // computeUSD prices a call from the API-returned token usage (authoritative
 // counts) and the per-model price table — so the hard ceiling tracks real usage
 // even if the model/price changes (only the price table needs updating), and a
 // call books at the price of the model that actually served it.
 func computeUSD(model string, u Usage, cfg *Config) float64 {
 	p := cfg.priceFor(model)
 	nonCached := u.PromptTokens - u.CachedTokens
 	if nonCached < 0 {
 		nonCached = 0
 	}
-	return float64(nonCached)/1e6*cfg.PriceInputPerM +
+	return float64(nonCached)/1e6*p.InputPerM +
-		float64(cached)/1e6*cfg.PriceCachedPerM +
+		float64(u.CachedTokens)/1e6*p.CachedPerM +
-		float64(u.CompletionTokens)/1e6*cfg.PriceOutputPerM
+		float64(u.CompletionTokens)/1e6*p.OutputPerM
 }
 // react adds an emoji m.reaction to the triggering event — the bot's language-free
@ -415,27 +584,31 @@ func (b *Bot) react(ctx context.Context, roomID, eventID, emoji string) {
 // default, so this is a near-dead safety path; the reaction is far less intrusive
 // than the old text notice, but the once-gate keeps it from annotating every message
 // in the rare encrypted room.
-func (b *Bot) reactEncryptedOnce(ctx context.Context, roomID, eventID string) {
+// reactEncryptedOnce returns whether it reacted this call (true only the first time
 // for a room), so the caller can log the skip exactly once too.
 func (b *Bot) reactEncryptedOnce(ctx context.Context, roomID, eventID string) bool {
 	warned, err := b.st.HasWarnedEncrypted(roomID)
 	if err != nil {
 		b.log.Error("warned-flag read failed", "room", roomID, "err", err)
-		return
+		return false
 	}
 	if warned {
-		return
+		return false
 	}
 	b.react(ctx, roomID, eventID, reactEncrypted)
 	if err := b.st.SetWarnedEncrypted(roomID); err != nil {
 		b.log.Error("persist warned-flag failed", "room", roomID, "err", err)
 	}
 	return true
 }
 // sendReply sends the model's actual answer and records the completed exchange in the
-// conversation buffer so the next turn has context.
+// conversation buffer so the next turn has context. It RETURNS the send error so the
-func (b *Bot) sendReply(ctx context.Context, roomID string, trigger *Event, triggerMC *MessageContent, body string) {
+// caller can handle paid silence (§8.1): a billed answer that failed to deliver must
-	id := b.sendMessage(ctx, roomID, trigger, triggerMC, body)
+// refund the slot and react, not vanish.
-	if id == "" {
+func (b *Bot) sendReply(ctx context.Context, roomID string, trigger *Event, triggerMC *MessageContent, body string) error {
-		return
+	if err := b.sendMessage(ctx, roomID, trigger, triggerMC, body); err != nil {
 		return err
 	}
 	// Record the user trigger AND the assistant answer together, only AFTER the answer
 	// was sent, so a failed or empty generation never leaves a dangling user turn (a
@ -443,20 +616,21 @@ func (b *Bot) sendReply(ctx context.Context, roomID string, trigger *Event, trig
 	// Single-flight guarantees no other turn for this room interleaves between the two.
 	b.appendBuf(roomID, bufferedMsg{sender: trigger.Sender, body: triggerMC.Body, isBot: false})
 	b.appendBuf(roomID, bufferedMsg{sender: b.cfg.BotMXID, body: body, isBot: true})
 	return nil
 }
-// sendMessage builds and sends an m.notice reply, tracks our own event id, and returns
+// sendMessage builds and sends an m.notice reply and tracks our own event id. Returns
-// the new event id ("" on failure).
+// the send error (nil on success) so the caller can detect a failed delivery.
-func (b *Bot) sendMessage(ctx context.Context, roomID string, trigger *Event, triggerMC *MessageContent, body string) string {
+func (b *Bot) sendMessage(ctx context.Context, roomID string, trigger *Event, triggerMC *MessageContent, body string) error {
 	content := buildNoticeContent(trigger.EventID, trigger.Sender, triggerMC.RelatesTo, body)
 	id, err := b.mx.SendEvent(ctx, roomID, "m.room.message", content)
 	if err != nil {
 		b.log.Error("send failed", "room", roomID, "err", err)
-		return ""
+		return err
 	}
 	// Track our own reply so a future reply-to-it is recognised as addressing us.
 	b.botSent.Add(id)
-	return id
+	return nil
 }
 // startTypingKeepalive starts the typing indicator and keeps it alive for the whole
--- a/apps/ai-bot/bot_test.go
+++ b/apps/ai-bot/bot_test.go
@ -71,17 +71,22 @@ func TestStripReplyFallback(t *testing.T) {
 }
 func TestComputeUSD(t *testing.T) {
-	cfg := &Config{PriceInputPerM: 1.25, PriceCachedPerM: 0.20, PriceOutputPerM: 2.50}
+	const model = "grok-test"
-	var u xaiUsage
+	cfg := &Config{XAIModel: model, Prices: map[string]ModelPrice{
-	u.PromptTokens = 1_000_000
+		model: {InputPerM: 1.25, CachedPerM: 0.20, OutputPerM: 2.50},
-	u.PromptTokensDetails.CachedTokens = 400_000
+	}}
-	u.CompletionTokens = 1_000_000
+	u := Usage{PromptTokens: 1_000_000, CachedTokens: 400_000, CompletionTokens: 1_000_000}
 	// nonCached 600k*1.25 + cached 400k*0.20 + out 1M*2.50 = 0.75 + 0.08 + 2.50
-	got := computeUSD(u, cfg)
+	got := computeUSD(model, u, cfg)
 	want := 0.75 + 0.08 + 2.50
 	if diff := got - want; diff > 1e-9 || diff < -1e-9 {
 		t.Fatalf("computeUSD = %v, want %v", got, want)
 	}
 	// An unknown model falls back to the default model's price (never $0, which would
 	// blind the ceiling).
 	if got := computeUSD("unknown-model", u, cfg); got != want {
 		t.Fatalf("unknown-model fallback = %v, want default %v", got, want)
 	}
 }
 func TestBuildContextGroupDropsThirdParties(t *testing.T) {
--- a/apps/ai-bot/cascade.go
+++ b/apps/ai-bot/cascade.go
@ -0,0 +1,279 @@
 package main
 import (
 	"context"
 	"errors"
 	"fmt"
 	"strings"
 	"time"
 )
 // cascade.go is the generation half of the bot: given an admitted request, it routes
 // (router.go), runs the chosen route's provider(s), and ALWAYS degrades to grok_direct
 // on any layer being off or failing (§8.2). It returns a genResult the business logic
 // (respond) settles, sends, and logs — keeping ledger/never-silent/telemetry in one
 // place and the routing here. With every cascade flag off, classify returns grok_direct
 // and this collapses to exactly today's single Grok call.
 // genResult is everything respond needs from a generation: the answer, the model's
 // usage (for token billing), the FULL cost breakdown (router + web + final), and the
 // routing metadata for telemetry. cost accumulates across stages, so a partial cascade
 // (a paid web fetch that then degraded) still books what it actually spent.
 type genResult struct {
 	text       string
 	usage      Usage
 	cost       CostBreakdown
 	finalModel string
 	providerID string
 	decision   RouterDecision
 	route      string // the route actually taken (may differ from decision on degrade)
 	fallback   bool   // true if we degraded off the decided route
 	degraded   string // degrade reason for request_log
 	stageMS    map[string]int
 }
 func msSince(t time.Time) int { return int(time.Since(t).Milliseconds()) }
 // reserveEstimate is the admission envelope: the most expensive ENABLED route's cost,
 // so whichever route the router picks is covered by the reservation (the ceiling can't
 // be slipped by routing to a pricier path after admission). With every cascade flag
 // off it equals grok_direct's estimate — byte-for-byte today's reservation. Slightly
 // generous is fine: Settle books the authoritative actual afterward.
 func (b *Bot) reserveEstimate() float64 {
 	est := b.estimateUSD(b.cfg.XAIModel) // grok_direct / trivial(cheaper)/synthesis base
 	if b.cfg.WebEnabled {
 		// web_then_grok = a web fetch fee + the Grok synthesis already counted above.
 		if b.cfg.WebProvider == webProviderGrokWebSearch {
 			// fetch can search several times and pull large context; reserve generously.
 			est += float64(maxWebSearchCalls)*grokWebSearchPerCall + b.estimateUSD(b.cfg.XAIModel)
 		} else {
 			est += b.estimateUSD(b.cfg.GeminiModel)
 		}
 	}
 	if b.cfg.ReasoningEnabled {
 		// Higher reasoning effort can burn more output tokens; reserve double.
 		est = max(est, 2*b.estimateUSD(b.cfg.ReasoningModel))
 	}
 	return est
 }
 // generate routes and produces an answer, degrading to grok_direct on any failure.
 // It returns a terminal error ONLY if even grok_direct fails; every other route falls
 // through to grok_direct rather than erroring.
 func (b *Bot) generate(ctx context.Context, body string, msgs []Message, convID string) (genResult, error) {
 	res := genResult{stageMS: map[string]int{}, finalModel: b.cfg.XAIModel}
 	t0 := time.Now()
 	res.decision = b.classify(ctx, body, &res.cost) // accumulates cost.Router if Layer-1 runs
 	res.stageMS["router"] = msSince(t0)
 	res.route = res.decision.Route
 	finalMsgs := msgs
 	switch res.decision.Route {
 	case routeTrivial:
 		if b.cfg.TrivialOffloadEnabled && b.gemini != nil {
 			if err := b.genTrivial(ctx, msgs, &res); err == nil {
 				return res, nil
 			} else {
 				b.log.Warn("trivial offload failed; degrading to grok_direct", "err", err)
 				b.degradeTo(&res, degradeTrivial)
 			}
 		}
 	case routeWebThenGrok:
 		if b.cfg.WebEnabled && b.web != nil {
 			if err := b.genWebThenGrok(ctx, body, msgs, convID, &res); err == nil {
 				return res, nil
 			} else {
 				b.log.Warn("web route failed; degrading to grok_direct", "err", err, "reason", res.degraded)
 				b.degradeTo(&res, degradeWeb)
 				// The question wanted fresh facts but we have none — answer from training
 				// knowledge WITH an honest staleness caveat, not stale-as-current (§8.2.1).
 				finalMsgs = hedgeMessages(msgs)
 			}
 		}
 	case routeReason:
 		if b.cfg.ReasoningEnabled {
 			if err := b.genReason(ctx, msgs, convID, &res); err == nil {
 				return res, nil
 			} else {
 				b.log.Warn("reasoning route failed; degrading to grok_direct", "err", err)
 				b.degradeTo(&res, degradeReasoning)
 			}
 		}
 	}
 	// grok_direct — the default route AND the universal fallback. The only path that
 	// can return a terminal error (even Grok failed). It preserves any cost already
 	// spent (router classifier, a partial web fetch) in res.cost.
 	if err := b.genGrokDirect(ctx, finalMsgs, convID, &res); err != nil {
 		return res, err
 	}
 	return res, nil
 }
 // degradeTo marks res as a fallback to grok_direct, keeping the first/most-specific
 // degrade reason (e.g. a web provider's grounding_cap set inside genWebThenGrok).
 func (b *Bot) degradeTo(res *genResult, reason string) {
 	res.fallback = true
 	if res.degraded == "" {
 		res.degraded = reason
 	}
 }
 // genGrokDirect is today's path: one Grok call. Also the fallback for every other
 // route. On success it fills res (route, final model, text, usage, provider id) and
 // adds the token cost.
 func (b *Bot) genGrokDirect(ctx context.Context, msgs []Message, convID string, res *genResult) error {
 	t := time.Now()
 	resp, err := b.llm.Complete(ctx, LLMRequest{
 		Model:           b.cfg.XAIModel,
 		Messages:        msgs,
 		MaxTokens:       b.cfg.MaxOutTok,
 		Temperature:     b.cfg.XAITemp,
 		ConvID:          convID,
 		ReasoningEffort: b.cfg.GrokReasoningEffort, // "" → not sent; "none" keeps grok-4.3 fast
 	})
 	res.stageMS["final"] = msSince(t)
 	if err != nil {
 		return err
 	}
 	res.route, res.finalModel = routeGrokDirect, b.cfg.XAIModel
 	res.text, res.usage, res.providerID = resp.Text, resp.Usage, resp.ProviderRequestID
 	res.cost.Token += computeUSD(b.cfg.XAIModel, resp.Usage, b.cfg)
 	return nil
 }
 // genTrivial answers a trivial message with the cheap Gemini model. An empty reply is
 // treated as a failure so the caller degrades to Grok rather than sending nothing.
 func (b *Bot) genTrivial(ctx context.Context, msgs []Message, res *genResult) error {
 	t := time.Now()
 	resp, err := b.gemini.Complete(ctx, LLMRequest{
 		Model:       b.cfg.GeminiModel,
 		Messages:    msgs,
 		MaxTokens:   b.cfg.MaxOutTok,
 		Temperature: b.cfg.XAITemp,
 	})
 	res.stageMS["final"] = msSince(t)
 	if err != nil {
 		return err
 	}
 	if strings.TrimSpace(resp.Text) == "" {
 		return fmt.Errorf("trivial: empty Gemini reply")
 	}
 	res.route, res.finalModel = routeTrivial, b.cfg.GeminiModel
 	res.text, res.usage, res.providerID = resp.Text, resp.Usage, resp.ProviderRequestID
 	res.cost.Token += computeUSD(b.cfg.GeminiModel, resp.Usage, b.cfg)
 	return nil
 }
 // genReason answers with Grok at a higher reasoning effort. Uses the configured
 // reasoning-capable model (the default grok-4.20-non-reasoning would reject the param).
 func (b *Bot) genReason(ctx context.Context, msgs []Message, convID string, res *genResult) error {
 	t := time.Now()
 	resp, err := b.llm.Complete(ctx, LLMRequest{
 		Model:           b.cfg.ReasoningModel,
 		Messages:        msgs,
 		MaxTokens:       b.cfg.MaxOutTok,
 		Temperature:     b.cfg.XAITemp,
 		ReasoningEffort: b.cfg.ReasoningEffort, // "think harder" level (default high)
 		ConvID:          convID,
 	})
 	res.stageMS["final"] = msSince(t)
 	if err != nil {
 		return err
 	}
 	if strings.TrimSpace(resp.Text) == "" {
 		return fmt.Errorf("reason: empty reply")
 	}
 	res.route, res.finalModel = routeReason, b.cfg.ReasoningModel
 	res.text, res.usage, res.providerID = resp.Text, resp.Usage, resp.ProviderRequestID
 	res.cost.Token += computeUSD(b.cfg.ReasoningModel, resp.Usage, b.cfg)
 	return nil
 }
 // webStageTimeout bounds the web/grounding fetch independently of the overall budget
 // (§8.2.2): a slow search must not eat the whole request before synthesis.
 const webStageTimeout = 15 * time.Second
 // genWebThenGrok fetches fresh facts via the web provider, then has Grok synthesise the
 // answer in voice from that digest. The web fetch's cost+tokens are booked into res
 // EVEN ON FAILURE — the call was billed — so a synth failure or empty fetch still
 // accounts for the spend before the caller degrades to grok_direct (the partial cascade
 // case, §8.1). The daily cap and per-stage deadline are applied here, uniformly for both
 // providers.
 func (b *Bot) genWebThenGrok(ctx context.Context, body string, msgs []Message, convID string, res *genResult) error {
 	// Per-stage web/grounding deadline, independent of the overall budget.
 	wctx, cancelW := context.WithTimeout(ctx, webStageTimeout)
 	tw := time.Now()
 	wc, ferr := b.web.Fetch(wctx, body)
 	cancelW()
 	res.stageMS["web"] = msSince(tw)
 	// Book the fetch's fee + tokens whether or not it produced a usable digest — the call
 	// was billed (the daily cap, if any, is enforced inside the provider).
 	res.cost.Grounding += wc.Cost.Grounding
 	res.cost.WebTool += wc.Cost.WebTool
 	webUsage := wc.Usage
 	if ferr != nil {
 		if errors.Is(ferr, errGroundingCapped) {
 			res.degraded = degradeGroundCap
 		}
 		return ferr // web fee already booked; caller degrades to grok_direct (with hedge)
 	}
 	tf := time.Now()
 	resp, err := b.llm.Complete(ctx, LLMRequest{
 		Model:           b.cfg.XAIModel,
 		Messages:        webSynthMessages(msgs, wc),
 		MaxTokens:       b.cfg.MaxOutTok,
 		Temperature:     b.cfg.XAITemp,
 		ConvID:          convID,
 		ReasoningEffort: b.cfg.GrokReasoningEffort, // same voice, same effort as grok_direct
 	})
 	res.stageMS["final"] = msSince(tf)
 	if err != nil {
 		return err
 	}
 	if strings.TrimSpace(resp.Text) == "" {
 		return fmt.Errorf("web synth: empty reply")
 	}
 	res.route, res.finalModel = routeWebThenGrok, b.cfg.XAIModel
 	res.text, res.providerID = resp.Text, resp.ProviderRequestID
 	// Report BOTH calls' tokens so the analytics token totals match the two-call route.
 	res.usage = Usage{
 		PromptTokens:     resp.Usage.PromptTokens + webUsage.PromptTokens,
 		CachedTokens:     resp.Usage.CachedTokens + webUsage.CachedTokens,
 		CompletionTokens: resp.Usage.CompletionTokens + webUsage.CompletionTokens,
 	}
 	res.cost.Token += computeUSD(b.cfg.XAIModel, resp.Usage, b.cfg)
 	return nil
 }
 // webSynthMessages inserts the fresh web digest (and its sources) as a system note just
 // after the system prompt, so Grok answers in voice using current facts.
 func webSynthMessages(base []Message, wc WebContext) []Message {
 	facts := "Свежие данные из веба (используй их в ответе и сошлись на источники):\n" + wc.Digest
 	if len(wc.Citations) > 0 {
 		facts += "\nИсточники: " + strings.Join(wc.Citations, ", ")
 	}
 	return insertSystemNote(base, facts)
 }
 // hedgeMessages adds an honest staleness caveat for a web→grok_direct degrade: the user
 // wanted fresh facts but we couldn't fetch them, so the model must flag that its answer
 // is from training knowledge and may be out of date.
 func hedgeMessages(base []Message) []Message {
 	return insertSystemNote(base, "Нет доступа к свежим источникам прямо сейчас — отвечай по знаниям на момент обучения и честно предупреди, что данные могут быть устаревшими.")
 }
 // insertSystemNote inserts an extra system message right after the system prompt
 // (base[0] from buildContext), preserving the rest of the window.
 func insertSystemNote(base []Message, content string) []Message {
 	note := Message{Role: "system", Content: content}
 	if len(base) == 0 {
 		return []Message{note}
 	}
 	out := make([]Message, 0, len(base)+1)
 	out = append(out, base[0], note)
 	out = append(out, base[1:]...)
 	return out
 }
--- a/apps/ai-bot/cascade_test.go
+++ b/apps/ai-bot/cascade_test.go
@ -0,0 +1,275 @@
 package main
 import (
 	"context"
 	"errors"
 	"io"
 	"log/slog"
 	"testing"
 )
 func discardLog() *slog.Logger { return slog.New(slog.NewTextHandler(io.Discard, nil)) }
 // fakeLLM is a scriptable LLMClient for dispatch/degrade tests.
 type fakeLLM struct {
 	text    string
 	usage   Usage
 	err     error
 	calls   int
 	lastReq LLMRequest
 }
 func (f *fakeLLM) Complete(_ context.Context, req LLMRequest) (*LLMResponse, error) {
 	f.calls++
 	f.lastReq = req
 	if f.err != nil {
 		return nil, f.err
 	}
 	return &LLMResponse{Text: f.text, Usage: f.usage, ProviderRequestID: "fake"}, nil
 }
 type fakeWeb struct {
 	wc    WebContext
 	err   error
 	calls int
 }
 func (f *fakeWeb) Fetch(_ context.Context, _ string) (WebContext, error) {
 	f.calls++
 	if f.err != nil {
 		return WebContext{}, f.err
 	}
 	return f.wc, nil
 }
 // cascadeCfg is a config with the model/price table set and EVERY cascade flag off.
 // Tests flip individual flags on a copy.
 func cascadeCfg() Config {
 	return Config{
 		XAIModel: "grok-x", GeminiModel: "gemini-x", ReasoningModel: "grok-reason",
 		MaxOutTok: 100, XAITemp: 0.5,
 		ReasoningTrigger: "подумай глубже",
 		ReasoningEffort:  "high",
 		WebProvider:      webProviderGrokWebSearch,
 		Prices: map[string]ModelPrice{
 			"grok-x":   {InputPerM: 1, CachedPerM: 0.2, OutputPerM: 2},
 			"gemini-x": {InputPerM: 0.1, CachedPerM: 0.1, OutputPerM: 0.4},
 		},
 	}
 }
 func msgs(body string) []Message {
 	return []Message{{Role: "system", Content: "SYS"}, {Role: "user", Content: body}}
 }
 // TestGenerateAllFlagsOffIsGrokDirect is the cascade-off parity invariant: even a
 // "trivial"-looking message goes to Grok, and Gemini is never touched, when the router
 // is off.
 func TestGenerateAllFlagsOffIsGrokDirect(t *testing.T) {
 	grok := &fakeLLM{text: "grok answer"}
 	gem := &fakeLLM{text: "should not run"}
 	cfg := cascadeCfg()
 	b := &Bot{cfg: &cfg, llm: grok, gemini: gem, log: discardLog()}
 	res, err := b.generate(context.Background(), "привет", msgs("привет"), "")
 	if err != nil {
 		t.Fatalf("generate: %v", err)
 	}
 	if res.route != routeGrokDirect || res.text != "grok answer" {
 		t.Fatalf("res = (%q,%q), want grok_direct/\"grok answer\"", res.route, res.text)
 	}
 	if res.decision.Source != "default" {
 		t.Fatalf("router source = %q, want default (router off)", res.decision.Source)
 	}
 	if grok.calls != 1 || gem.calls != 0 {
 		t.Fatalf("calls grok=%d gem=%d, want 1/0", grok.calls, gem.calls)
 	}
 }
 func TestGenerateTrivialOffload(t *testing.T) {
 	grok := &fakeLLM{text: "grok"}
 	gem := &fakeLLM{text: "gemini trivial"}
 	cfg := cascadeCfg()
 	cfg.RouterEnabled, cfg.TrivialOffloadEnabled = true, true
 	b := &Bot{cfg: &cfg, llm: grok, gemini: gem, log: discardLog()}
 	res, err := b.generate(context.Background(), "привет", msgs("привет"), "")
 	if err != nil {
 		t.Fatalf("generate: %v", err)
 	}
 	if res.route != routeTrivial || res.text != "gemini trivial" || res.finalModel != "gemini-x" {
 		t.Fatalf("res = (%q,%q,%q), want trivial/gemini", res.route, res.text, res.finalModel)
 	}
 	if gem.calls != 1 || grok.calls != 0 {
 		t.Fatalf("calls grok=%d gem=%d, want 0/1 (Gemini answered)", grok.calls, gem.calls)
 	}
 }
 // TestGenerateTrivialDegradesToGrok: Gemini failing on the trivial route must fall back
 // to Grok, never go silent.
 func TestGenerateTrivialDegradesToGrok(t *testing.T) {
 	grok := &fakeLLM{text: "grok fallback"}
 	gem := &fakeLLM{err: errors.New("gemini down")}
 	cfg := cascadeCfg()
 	cfg.RouterEnabled, cfg.TrivialOffloadEnabled = true, true
 	b := &Bot{cfg: &cfg, llm: grok, gemini: gem, log: discardLog()}
 	res, err := b.generate(context.Background(), "привет", msgs("привет"), "")
 	if err != nil {
 		t.Fatalf("generate: %v", err)
 	}
 	if res.route != routeGrokDirect || res.text != "grok fallback" {
 		t.Fatalf("res = (%q,%q), want grok_direct fallback", res.route, res.text)
 	}
 	if !res.fallback || res.degraded != degradeTrivial {
 		t.Fatalf("fallback=%v degraded=%q, want true/trivial_failed", res.fallback, res.degraded)
 	}
 	if gem.calls != 1 || grok.calls != 1 {
 		t.Fatalf("calls grok=%d gem=%d, want 1/1", grok.calls, gem.calls)
 	}
 }
 func TestGenerateWebThenGrok(t *testing.T) {
 	grok := &fakeLLM{text: "synthesised", usage: Usage{PromptTokens: 100, CompletionTokens: 50}}
 	web := &fakeWeb{wc: WebContext{Digest: "fresh facts", Citations: []string{"http://src"}, Cost: CostBreakdown{WebTool: 0.1}}}
 	cfg := cascadeCfg()
 	cfg.RouterEnabled, cfg.WebEnabled = true, true
 	b := &Bot{cfg: &cfg, llm: grok, web: web, log: discardLog()}
 	res, err := b.generate(context.Background(), "какие новости сегодня", msgs("какие новости сегодня"), "")
 	if err != nil {
 		t.Fatalf("generate: %v", err)
 	}
 	if res.route != routeWebThenGrok || res.text != "synthesised" {
 		t.Fatalf("res = (%q,%q), want web_then_grok/synthesised", res.route, res.text)
 	}
 	if res.cost.WebTool != 0.1 || res.cost.Token <= 0 {
 		t.Fatalf("cost = %+v, want WebTool 0.1 + Token>0", res.cost)
 	}
 	if web.calls != 1 || grok.calls != 1 {
 		t.Fatalf("calls web=%d grok=%d, want 1/1", web.calls, grok.calls)
 	}
 }
 // TestGenerateWebDegradesToGrok: a web fetch failure (provider down or cap hit) degrades
 // to grok_direct and books no web cost.
 func TestGenerateWebDegradesToGrok(t *testing.T) {
 	grok := &fakeLLM{text: "grok fallback"}
 	web := &fakeWeb{err: errGroundingCapped}
 	cfg := cascadeCfg()
 	cfg.RouterEnabled, cfg.WebEnabled = true, true
 	b := &Bot{cfg: &cfg, llm: grok, web: web, log: discardLog()}
 	res, err := b.generate(context.Background(), "новости сегодня", msgs("новости сегодня"), "")
 	if err != nil {
 		t.Fatalf("generate: %v", err)
 	}
 	if res.route != routeGrokDirect || res.text != "grok fallback" || !res.fallback {
 		t.Fatalf("res = (%q,%q,fallback=%v), want grok_direct fallback", res.route, res.text, res.fallback)
 	}
 	if res.degraded != degradeGroundCap {
 		t.Fatalf("degraded = %q, want grounding_cap (the specific reason)", res.degraded)
 	}
 	if res.cost.WebTool != 0 || res.cost.Grounding != 0 {
 		t.Fatalf("web cost = %+v, want 0 (fetch failed before billing)", res.cost)
 	}
 }
 // TestGenerateReasoningForced: the manual trigger routes to the reasoning model with
 // reasoning_effort, independent of ROUTER_ENABLED.
 func TestGenerateReasoningForced(t *testing.T) {
 	grok := &fakeLLM{text: "deep answer"}
 	cfg := cascadeCfg()
 	cfg.ReasoningEnabled = true // ROUTER_ENABLED deliberately left off
 	b := &Bot{cfg: &cfg, llm: grok, log: discardLog()}
 	res, err := b.generate(context.Background(), "подумай глубже про сознание", msgs("подумай глубже про сознание"), "")
 	if err != nil {
 		t.Fatalf("generate: %v", err)
 	}
 	if res.route != routeReason || res.decision.Source != "forced" {
 		t.Fatalf("res route=%q source=%q, want reason/forced", res.route, res.decision.Source)
 	}
 	if grok.lastReq.ReasoningEffort != "high" || grok.lastReq.Model != "grok-reason" {
 		t.Fatalf("reasoning req = (effort %q, model %q), want high/grok-reason", grok.lastReq.ReasoningEffort, grok.lastReq.Model)
 	}
 }
 // TestClassifierConfidenceFloor: a Layer-1 classifier label that escalates off the safe
 // floor (trivial/web) must clear the confidence floor, else the request stays on
 // grok_direct — the false-trivial voice-leak guard (§8.6).
 func TestClassifierConfidenceFloor(t *testing.T) {
 	cfg := cascadeCfg()
 	cfg.RouterEnabled, cfg.RouterClassifierEnabled = true, true
 	gem := &fakeLLM{}
 	b := &Bot{cfg: &cfg, gemini: gem, log: discardLog()}
 	var cost CostBreakdown
 	const substantive = "напиши подробное эссе про историю римской империи" // Layer-0 → grok_direct
 	gem.text = `{"route":"trivial","confidence":0.2}` // low-confidence escalation
 	if d := b.classify(context.Background(), substantive, &cost); d.Route != routeGrokDirect {
 		t.Fatalf("low-confidence trivial must stay grok_direct (safe floor), got %q", d.Route)
 	}
 	gem.text = `{"route":"trivial","confidence":0.95}` // confident escalation is honoured
 	if d := b.classify(context.Background(), substantive, &cost); d.Route != routeTrivial {
 		t.Fatalf("high-confidence trivial should route trivial, got %q", d.Route)
 	}
 	// A classifier error degrades to the Layer-0 verdict (grok_direct), never silence.
 	gem.text, gem.err = "", errors.New("gemini down")
 	if d := b.classify(context.Background(), substantive, &cost); d.Route != routeGrokDirect {
 		t.Fatalf("classifier failure must fall back to heuristic grok_direct, got %q", d.Route)
 	}
 }
 // TestGrokReasoningEffort: GROK_REASONING_EFFORT is sent on grok_direct (so grok-4.3 can
 // be kept fast with "none"), empty means not sent (compat with grok-4.20-non-reasoning),
 // and the reason route always overrides to "high" regardless.
 func TestGrokReasoningEffort(t *testing.T) {
 	// Configured effort reaches grok_direct.
 	grok := &fakeLLM{text: "ok"}
 	cfg := cascadeCfg()
 	cfg.GrokReasoningEffort = "none"
 	b := &Bot{cfg: &cfg, llm: grok, log: discardLog()}
 	if _, err := b.generate(context.Background(), "hello", msgs("hello"), ""); err != nil {
 		t.Fatal(err)
 	}
 	if grok.lastReq.ReasoningEffort != "none" {
 		t.Fatalf("grok_direct effort = %q, want none", grok.lastReq.ReasoningEffort)
 	}
 	// Empty default → not sent (so grok-4.20-non-reasoning keeps working).
 	grokDef := &fakeLLM{text: "ok"}
 	cfgDef := cascadeCfg() // GrokReasoningEffort == ""
 	bDef := &Bot{cfg: &cfgDef, llm: grokDef, log: discardLog()}
 	if _, err := bDef.generate(context.Background(), "hello", msgs("hello"), ""); err != nil {
 		t.Fatal(err)
 	}
 	if grokDef.lastReq.ReasoningEffort != "" {
 		t.Fatalf("default effort = %q, want empty (not sent)", grokDef.lastReq.ReasoningEffort)
 	}
 	// The reason route ignores GROK_REASONING_EFFORT and always uses "high".
 	grokR := &fakeLLM{text: "deep"}
 	cfgR := cascadeCfg()
 	cfgR.GrokReasoningEffort = "none"
 	cfgR.ReasoningEnabled = true
 	bR := &Bot{cfg: &cfgR, llm: grokR, log: discardLog()}
 	if _, err := bR.generate(context.Background(), "подумай глубже про X", msgs("подумай глубже про X"), ""); err != nil {
 		t.Fatal(err)
 	}
 	if grokR.lastReq.ReasoningEffort != "high" {
 		t.Fatalf("reason route effort = %q, want high (overrides GROK_REASONING_EFFORT)", grokR.lastReq.ReasoningEffort)
 	}
 }
 // TestGenerateTerminalErrorPropagates: if even grok_direct fails, generate returns the
 // error (respond turns it into refund + react), not a silent empty success.
 func TestGenerateTerminalErrorPropagates(t *testing.T) {
 	grok := &fakeLLM{err: errors.New("xai down")}
 	cfg := cascadeCfg()
 	b := &Bot{cfg: &cfg, llm: grok, log: discardLog()}
 	if _, err := b.generate(context.Background(), "hello", msgs("hello"), ""); err == nil {
 		t.Fatal("want terminal error when grok_direct fails, got nil")
 	}
 }
--- a/apps/ai-bot/config.go
+++ b/apps/ai-bot/config.go
@ -5,6 +5,7 @@ import (
 	"os"
 	"strconv"
 	"strings"
 	"time"
 )
 // Config is the fully-resolved runtime configuration, parsed once from the
@ -36,23 +37,106 @@ type Config struct {
 	MaxOutTok   int
 	MaxCtxEvent int
 	// GrokReasoningEffort is the reasoning_effort sent on the normal Grok voice calls
 	// (grok_direct + web synthesis). Empty = don't send it (the default — required for
 	// grok-4.20-non-reasoning, which rejects the param). On a unified model like
 	// grok-4.3 the API otherwise defaults to "low" (it thinks on every reply); set this
 	// to "none" to keep the default voice fast/cheap. The reason_then_grok route ignores
 	// this and always uses "high". Accepted: "" | none | low | medium | high.
 	GrokReasoningEffort string
 	// Allowlist of homeservers whose users may pull the bot into a room. Gates
 	// the *inviter* (F11). Comma-separated env, stored as a set.
 	AllowedServers map[string]bool
 	DailyUSDCeiling float64
 	PerUserDailyCap int
 	// PerUserDailyUSD is an optional per-user daily $ quota (0 = off) on top of the
 	// request count cap, so one user on expensive routes can't drain the shared global
 	// ceiling and deny everyone else. Checked against the user's own committed+reserved
 	// spend in Reserve.
 	PerUserDailyUSD float64
 	// mxids exempt from PER_USER_DAILY_CAP (e.g. the owner/admins testing). Still
 	// subject to the global DAILY_USD_CEILING, so the wallet stays protected.
 	UnlimitedUsers map[string]bool
-	// USD-per-1M-token prices applied to the API-returned token usage so the
+	// USD-per-1M-token prices for the default (final-voice) model, applied to the
-	// hard ceiling tracks real usage even if the model/price changes.
+	// API-returned token usage so the hard ceiling tracks real usage even if the
 	// model/price changes. Kept as the back-compat XAI_PRICE_* source; folded into
 	// Prices below.
 	PriceInputPerM  float64
 	PriceCachedPerM float64
 	PriceOutputPerM float64
 	// Prices is the per-model price table (LiteLLM pattern) read by priceFor(model),
 	// so a call books at the price of the model that actually served it. Built in
 	// LoadConfig; the default model's entry comes from the XAI_PRICE_* envs, and a
 	// second model (Gemini) adds its own entry when that layer lands.
 	Prices map[string]ModelPrice
 	// RequestBudget bounds one whole request (all model calls share it), so a slow or
 	// retried call — or a multi-stage cascade — can't accrete minutes. The default
 	// matches the previous effective ceiling for a single grok_direct call.
 	RequestBudget time.Duration
 	// GrokPromptCache, when true, sends the x-grok-conv-id routing header to raise the
 	// prompt-cache hit rate (Grok caches automatically; the header only pins routing).
 	GrokPromptCache bool
 	// TelemetryEnabled writes the request_log analytics row for every request. Default
 	// off so the cascade-off path adds no extra write; turned on to measure the base.
 	// Its write is isolated — a failure logs a WARN, never drops the answer.
 	TelemetryEnabled bool
 	// TelemetryStoreText additionally stores the query text in request_log (for offline
 	// eval). Default off — only metadata is kept.
 	TelemetryStoreText bool
 	// TelemetryRetention trims request_log rows older than this (time-based, since the
 	// analytics are a time series). 0 disables trimming.
 	TelemetryRetention time.Duration
 	// --- Cascade (Phase 2-4). EVERY flag defaults OFF, so an unset environment is
 	// exactly today's bot: one grok_direct call. Any layer off or failing degrades to
 	// grok_direct (§8.2). None of these is enabled in prod until the offline-eval gate
 	// (§9) passes. ---
 	// RouterEnabled turns on the Layer-0 heuristic router; off → everything is
 	// grok_direct. RouterClassifierEnabled additionally consults the Gemini Layer-1
 	// classifier on cases the heuristic left as grok_direct.
 	RouterEnabled           bool
 	RouterClassifierEnabled bool
 	// TrivialOffloadEnabled lets the trivial route answer with Gemini; off → trivial
 	// still goes to Grok.
 	TrivialOffloadEnabled bool
 	// WebEnabled turns on the web_then_grok route. WebProvider selects the source:
 	// grok_web_search (default, works on chat/completions via Live Search) or
 	// gemini_grounding (Gemini-3 native only — see F-EXT-3).
 	WebEnabled  bool
 	WebProvider string
 	// WebGroundingDailyCap caps grounded prompts/day (durable counter) before falling
 	// back, guarding the $/1k grounding overage. WebGroundingTier records the Gemini
 	// plan the cap reflects.
 	WebGroundingDailyCap int
 	WebGroundingTier     string
 	// Reasoning route: a manual "think harder" trigger. ReasoningModel must be a
 	// reasoning-capable model (the default grok-4.20-non-reasoning is NOT — see the
 	// docs.x.ai finding); set REASONING_MODEL to e.g. grok-4.3 to use it.
 	ReasoningEnabled bool
 	ReasoningTrigger string
 	ReasoningModel   string
 	// ReasoningEffort is the reasoning_effort the reason_then_grok route sends on the
 	// manual "think harder" trigger. Default "high". Accepted: none|low|medium|high.
 	ReasoningEffort string
 	// CanaryPercent routes a fraction of traffic through the new path for A/B before a
 	// full enable. 0 = off (scaffold; not yet consulted by the dispatch).
 	CanaryPercent int
 	// Gemini backend (the cheap/router/grounding model). Required only when a layer
 	// that uses it is enabled (validated below).
 	GeminiBaseURL string
 	GeminiAPIKey  string
 	GeminiModel   string
 	SystemPromptPath string
 	SystemPrompt     string
 	StateDir         string
@ -111,6 +195,23 @@ func getenvFloat(key string, def float64) (float64, error) {
 	return f, nil
 }
 // getenvBool parses a boolean flag. Accepts the usual 1/0/true/false/yes/no/on/off
 // (case-insensitive); empty → default. Every cascade flag defaults false, so an unset
 // or blank env keeps today's behaviour.
 func getenvBool(key string, def bool) (bool, error) {
 	raw := strings.TrimSpace(getenv(key, ""))
 	if raw == "" {
 		return def, nil
 	}
 	switch strings.ToLower(raw) {
 	case "1", "true", "yes", "on":
 		return true, nil
 	case "0", "false", "no", "off":
 		return false, nil
 	}
 	return false, fmt.Errorf("%s must be a boolean (true/false), got %q", key, raw)
 }
 func parseServerSet(raw string) map[string]bool {
 	set := make(map[string]bool)
 	for _, s := range strings.Split(raw, ",") {
@ -139,6 +240,16 @@ func LoadConfig() (*Config, error) {
 		DatabaseURL:      getenv("AI_BOT_DATABASE_URL", ""),
 		AllowedServers:   parseServerSet(getenv("ALLOWED_SERVERS", "")),
 		UnlimitedUsers:   parseServerSet(getenv("UNLIMITED_USERS", "")),
 		// Cascade string-valued config (flags/ints/secrets parsed below).
 		GrokReasoningEffort: strings.ToLower(strings.TrimSpace(getenv("GROK_REASONING_EFFORT", ""))),
 		WebProvider:         getenv("WEB_PROVIDER", webProviderGrokWebSearch),
 		WebGroundingTier:    getenv("WEB_GROUNDING_TIER", "free"),
 		ReasoningTrigger:    getenv("REASONING_TRIGGER", "подумай глубже"),
 		ReasoningModel:      getenv("REASONING_MODEL", "grok-4.3"),
 		ReasoningEffort:     strings.ToLower(strings.TrimSpace(getenv("REASONING_EFFORT", "high"))),
 		GeminiBaseURL:       strings.TrimRight(getenv("GEMINI_BASE_URL", "https://generativelanguage.googleapis.com/v1beta/openai"), "/"),
 		GeminiModel:         getenv("GEMINI_MODEL", "gemini-2.5-flash-lite"),
 	}
 	var problems []string
@ -152,6 +263,7 @@ func LoadConfig() (*Config, error) {
 		{"AS_TOKEN", &cfg.ASToken},
 		{"HS_TOKEN", &cfg.HSToken},
 		{"XAI_API_KEY", &cfg.XAIAPIKey},
 		{"GEMINI_API_KEY", &cfg.GeminiAPIKey}, // optional; required only if a Gemini layer is on
 	} {
 		v, err := getSecret(s.key)
 		if err != nil {
@ -207,6 +319,9 @@ func LoadConfig() (*Config, error) {
 	if cfg.PerUserDailyCap, err = getenvInt("PER_USER_DAILY_CAP", 30); err != nil {
 		problems = append(problems, err.Error())
 	}
 	if cfg.PerUserDailyUSD, err = getenvFloat("PER_USER_DAILY_USD", 0); err != nil {
 		problems = append(problems, err.Error())
 	}
 	if cfg.PriceInputPerM, err = getenvFloat("XAI_PRICE_INPUT_PER_M", 1.25); err != nil {
 		problems = append(problems, err.Error())
 	}
@ -216,6 +331,110 @@ func LoadConfig() (*Config, error) {
 	if cfg.PriceOutputPerM, err = getenvFloat("XAI_PRICE_OUTPUT_PER_M", 2.50); err != nil {
 		problems = append(problems, err.Error())
 	}
 	// Per-model price table. The default (final-voice) model is priced from the
 	// XAI_PRICE_* envs; additional models register their own entry as their layer
 	// lands. priceFor falls back to this default model for an unknown model.
 	cfg.Prices = map[string]ModelPrice{
 		cfg.XAIModel: {
 			InputPerM:  cfg.PriceInputPerM,
 			CachedPerM: cfg.PriceCachedPerM,
 			OutputPerM: cfg.PriceOutputPerM,
 		},
 	}
 	var budgetSec, retentionDays int
 	if budgetSec, err = getenvInt("REQUEST_BUDGET_SECONDS", 180); err != nil {
 		problems = append(problems, err.Error())
 	}
 	cfg.RequestBudget = time.Duration(budgetSec) * time.Second
 	if cfg.GrokPromptCache, err = getenvBool("GROK_PROMPT_CACHE", false); err != nil {
 		problems = append(problems, err.Error())
 	}
 	if cfg.TelemetryEnabled, err = getenvBool("TELEMETRY_ENABLED", false); err != nil {
 		problems = append(problems, err.Error())
 	}
 	if cfg.TelemetryStoreText, err = getenvBool("TELEMETRY_STORE_TEXT", false); err != nil {
 		problems = append(problems, err.Error())
 	}
 	if retentionDays, err = getenvInt("TELEMETRY_RETENTION_DAYS", 30); err != nil {
 		problems = append(problems, err.Error())
 	}
 	cfg.TelemetryRetention = time.Duration(retentionDays) * 24 * time.Hour
 	// Cascade flags — every one defaults false, so an unset env is today's bot.
 	for _, f := range []struct {
 		key  string
 		dest *bool
 	}{
 		{"ROUTER_ENABLED", &cfg.RouterEnabled},
 		{"ROUTER_CLASSIFIER_ENABLED", &cfg.RouterClassifierEnabled},
 		{"TRIVIAL_OFFLOAD_ENABLED", &cfg.TrivialOffloadEnabled},
 		{"WEB_ENABLED", &cfg.WebEnabled},
 		{"REASONING_ENABLED", &cfg.ReasoningEnabled},
 	} {
 		if *f.dest, err = getenvBool(f.key, false); err != nil {
 			problems = append(problems, err.Error())
 		}
 	}
 	if cfg.WebGroundingDailyCap, err = getenvInt("WEB_GROUNDING_DAILY_CAP", 450); err != nil {
 		problems = append(problems, err.Error())
 	}
 	if cfg.CanaryPercent, err = getenvInt("CANARY_PERCENT", 0); err != nil {
 		problems = append(problems, err.Error())
 	}
 	// Gemini pricing → the per-model table (defaults: gemini-2.5-flash-lite $0.10/$0.40
 	// per 1M; cached priced as input, a conservative over-count). Lets the ceiling and
 	// request_log price Gemini calls at Gemini rates.
 	var gIn, gOut float64
 	if gIn, err = getenvFloat("GEMINI_PRICE_INPUT_PER_M", 0.10); err != nil {
 		problems = append(problems, err.Error())
 	}
 	if gOut, err = getenvFloat("GEMINI_PRICE_OUTPUT_PER_M", 0.40); err != nil {
 		problems = append(problems, err.Error())
 	}
 	cfg.Prices[cfg.GeminiModel] = ModelPrice{InputPerM: gIn, CachedPerM: gIn, OutputPerM: gOut}
 	// Reasoning model price (defaults to the final-voice grok rates — grok-4.3 ≈ 4.20),
 	// so the reasoning route reserves/bills at its own price instead of falling back.
 	var rIn, rOut float64
 	if rIn, err = getenvFloat("REASONING_PRICE_INPUT_PER_M", cfg.PriceInputPerM); err != nil {
 		problems = append(problems, err.Error())
 	}
 	if rOut, err = getenvFloat("REASONING_PRICE_OUTPUT_PER_M", cfg.PriceOutputPerM); err != nil {
 		problems = append(problems, err.Error())
 	}
 	cfg.Prices[cfg.ReasoningModel] = ModelPrice{InputPerM: rIn, CachedPerM: cfg.PriceCachedPerM, OutputPerM: rOut}
 	// Fail-fast on broken cascade wiring (§5/F-FUNC-9), at EVERY start (not just
 	// check-config): a layer that needs Gemini but has no key would silently never
 	// fire. Better to refuse to start than to quietly run degraded.
 	needsGemini := cfg.TrivialOffloadEnabled || cfg.RouterClassifierEnabled ||
 		(cfg.WebEnabled && cfg.WebProvider == webProviderGeminiGrounding)
 	if needsGemini && cfg.GeminiAPIKey == "" {
 		problems = append(problems, "GEMINI_API_KEY is required when TRIVIAL_OFFLOAD_ENABLED, ROUTER_CLASSIFIER_ENABLED, or WEB_ENABLED with gemini_grounding is set")
 	}
 	if cfg.RouterClassifierEnabled && !cfg.RouterEnabled {
 		problems = append(problems, "ROUTER_CLASSIFIER_ENABLED requires ROUTER_ENABLED")
 	}
 	if cfg.WebEnabled && cfg.WebProvider != webProviderGrokWebSearch && cfg.WebProvider != webProviderGeminiGrounding {
 		problems = append(problems, fmt.Sprintf("WEB_PROVIDER must be %q or %q, got %q",
 			webProviderGrokWebSearch, webProviderGeminiGrounding, cfg.WebProvider))
 	}
 	if cfg.ReasoningEnabled && cfg.ReasoningModel == "" {
 		problems = append(problems, "REASONING_MODEL is required when REASONING_ENABLED is set")
 	}
 	switch cfg.GrokReasoningEffort {
 	case "", "none", "low", "medium", "high":
 	default:
 		problems = append(problems, fmt.Sprintf(
 			"GROK_REASONING_EFFORT must be one of none/low/medium/high (or empty), got %q", cfg.GrokReasoningEffort))
 	}
 	switch cfg.ReasoningEffort {
 	case "none", "low", "medium", "high":
 	default:
 		problems = append(problems, fmt.Sprintf(
 			"REASONING_EFFORT must be one of none/low/medium/high, got %q", cfg.ReasoningEffort))
 	}
 	if len(problems) > 0 {
 		return nil, fmt.Errorf("invalid configuration:\n  - %s", strings.Join(problems, "\n  - "))
@ -223,6 +442,14 @@ func LoadConfig() (*Config, error) {
 	return cfg, nil
 }
 // needsGemini reports whether any enabled layer requires the Gemini backend — the
 // cheap trivial route, the Layer-1 classifier, or Gemini-native web grounding. Drives
 // both the fail-fast key check and whether the client is built at all.
 func (c *Config) needsGemini() bool {
 	return c.TrivialOffloadEnabled || c.RouterClassifierEnabled ||
 		(c.WebEnabled && c.WebProvider == webProviderGeminiGrounding)
 }
 // Summary returns a human-readable, SECRET-REDACTED dump for the startup log.
 func (c *Config) Summary() string {
 	servers := make([]string, 0, len(c.AllowedServers))
@ -255,6 +482,12 @@ func (c *Config) Summary() string {
 		"  HS_TOKEN           = " + redact(c.HSToken),
 		"  XAI_BASE_URL       = " + c.XAIBaseURL,
 		"  XAI_MODEL          = " + c.XAIModel,
 		"  GROK_REASONING_EFFORT = " + func() string {
 			if c.GrokReasoningEffort == "" {
 				return "(unset — not sent; provider default)"
 			}
 			return c.GrokReasoningEffort
 		}(),
 		"  XAI_API_KEY        = " + redact(c.XAIAPIKey),
 		fmt.Sprintf("  XAI_TEMPERATURE    = %g", c.XAITemp),
 		fmt.Sprintf("  MAX_OUTPUT_TOKENS  = %d", c.MaxOutTok),
@ -268,5 +501,14 @@ func (c *Config) Summary() string {
 		"  SYSTEM_PROMPT_PATH = " + c.SystemPromptPath,
 		"  STATE_DIR          = " + c.StateDir,
 		"  AI_BOT_DATABASE_URL= " + redact(c.DatabaseURL),
 		fmt.Sprintf("  REQUEST_BUDGET     = %s", c.RequestBudget),
 		fmt.Sprintf("  GROK_PROMPT_CACHE  = %t", c.GrokPromptCache),
 		fmt.Sprintf("  TELEMETRY_ENABLED  = %t (store_text=%t, retention=%s)",
 			c.TelemetryEnabled, c.TelemetryStoreText, c.TelemetryRetention),
 		fmt.Sprintf("  CASCADE: router=%t classifier=%t trivial=%t web=%t(%s, cap=%d) reason=%t(%s)",
 			c.RouterEnabled, c.RouterClassifierEnabled, c.TrivialOffloadEnabled,
 			c.WebEnabled, c.WebProvider, c.WebGroundingDailyCap, c.ReasoningEnabled, c.ReasoningEffort),
 		"  GEMINI_MODEL       = " + c.GeminiModel,
 		"  GEMINI_API_KEY     = " + redact(c.GeminiAPIKey),
 	}, "\n")
 }
--- a/apps/ai-bot/config_test.go
+++ b/apps/ai-bot/config_test.go
@ -0,0 +1,98 @@
 package main
 import (
 	"strings"
 	"testing"
 )
 // setBaseEnv sets the minimal valid environment (all cascade flags off) so each test
 // can toggle one combination and assert the fail-fast validation (F-FUNC-9).
 func setBaseEnv(t *testing.T) {
 	t.Helper()
 	t.Setenv("HOMESERVER_URL", "http://hs")
 	t.Setenv("BOT_MXID", "@ai:vojo.chat")
 	t.Setenv("AS_TOKEN", "as")
 	t.Setenv("HS_TOKEN", "hs")
 	t.Setenv("XAI_API_KEY", "xai")
 	t.Setenv("AI_BOT_DATABASE_URL", "postgres://x")
 	t.Setenv("ALLOWED_SERVERS", "vojo.chat")
 	// Force a clean baseline so the host environment can't leak in.
 	for _, k := range []string{
 		"GEMINI_API_KEY", "GEMINI_API_KEY_FILE", "ROUTER_ENABLED", "ROUTER_CLASSIFIER_ENABLED",
 		"TRIVIAL_OFFLOAD_ENABLED", "WEB_ENABLED", "REASONING_ENABLED", "WEB_PROVIDER", "REASONING_MODEL",
 	} {
 		t.Setenv(k, "")
 	}
 }
 func TestConfigBaseValid(t *testing.T) {
 	setBaseEnv(t)
 	if _, err := LoadConfig(); err != nil {
 		t.Fatalf("base config should be valid: %v", err)
 	}
 }
 func TestConfigAllCascadeFlagsDefaultOff(t *testing.T) {
 	setBaseEnv(t)
 	cfg, err := LoadConfig()
 	if err != nil {
 		t.Fatalf("%v", err)
 	}
 	if cfg.RouterEnabled || cfg.RouterClassifierEnabled || cfg.TrivialOffloadEnabled ||
 		cfg.WebEnabled || cfg.ReasoningEnabled || cfg.TelemetryEnabled || cfg.GrokPromptCache {
 		t.Fatal("every cascade/telemetry flag must default off (cascade-off == today)")
 	}
 	if cfg.WebProvider != webProviderGrokWebSearch {
 		t.Fatalf("default WEB_PROVIDER = %q, want grok_web_search", cfg.WebProvider)
 	}
 }
 func TestConfigTrivialNeedsGeminiKey(t *testing.T) {
 	setBaseEnv(t)
 	t.Setenv("TRIVIAL_OFFLOAD_ENABLED", "true")
 	if _, err := LoadConfig(); err == nil || !strings.Contains(err.Error(), "GEMINI_API_KEY") {
 		t.Fatalf("want GEMINI_API_KEY error, got %v", err)
 	}
 	t.Setenv("GEMINI_API_KEY", "gk")
 	if _, err := LoadConfig(); err != nil {
 		t.Fatalf("with key it should be valid: %v", err)
 	}
 }
 func TestConfigClassifierNeedsRouter(t *testing.T) {
 	setBaseEnv(t)
 	t.Setenv("GEMINI_API_KEY", "gk")
 	t.Setenv("ROUTER_CLASSIFIER_ENABLED", "true") // without ROUTER_ENABLED
 	if _, err := LoadConfig(); err == nil || !strings.Contains(err.Error(), "ROUTER_ENABLED") {
 		t.Fatalf("want ROUTER_ENABLED error, got %v", err)
 	}
 }
 func TestConfigBadWebProvider(t *testing.T) {
 	setBaseEnv(t)
 	t.Setenv("WEB_ENABLED", "true")
 	t.Setenv("WEB_PROVIDER", "bing")
 	if _, err := LoadConfig(); err == nil || !strings.Contains(err.Error(), "WEB_PROVIDER") {
 		t.Fatalf("want WEB_PROVIDER error, got %v", err)
 	}
 }
 // The default web provider (grok_web_search) uses the existing xAI key, so WEB_ENABLED
 // alone must NOT demand a Gemini key.
 func TestConfigWebGrokNeedsNoGeminiKey(t *testing.T) {
 	setBaseEnv(t)
 	t.Setenv("WEB_ENABLED", "true")
 	if _, err := LoadConfig(); err != nil {
 		t.Fatalf("web+grok_web_search should not need a Gemini key: %v", err)
 	}
 }
 // gemini_grounding DOES need a Gemini key.
 func TestConfigWebGeminiGroundingNeedsKey(t *testing.T) {
 	setBaseEnv(t)
 	t.Setenv("WEB_ENABLED", "true")
 	t.Setenv("WEB_PROVIDER", webProviderGeminiGrounding)
 	if _, err := LoadConfig(); err == nil || !strings.Contains(err.Error(), "GEMINI_API_KEY") {
 		t.Fatalf("want GEMINI_API_KEY error, got %v", err)
 	}
 }
--- a/apps/ai-bot/context.go
+++ b/apps/ai-bot/context.go
@ -9,20 +9,20 @@ type bufferedMsg struct {
 	isBot  bool
 }
-// buildContext assembles the xAI message list under the owner's minimisation
+// buildContext assembles the provider-neutral message list under the owner's
-// rule ("trigger + bot replies only", §6/F8):
+// minimisation rule ("trigger + bot replies only", §6/F8):
 //
 //   - GROUP rooms: send ONLY the bot's own prior replies (assistant turns) plus
 //     the single triggering message (user turn). Other participants' messages and
-//     display names never reach xAI — the third-party-consent mitigation.
+//     display names never reach the model — the third-party-consent mitigation.
 //   - 1:1 rooms: there are no third parties, so the peer's recent turns are
 //     included too for coherence. Still no display names (pseudo "user").
 //
 // `history` is the recent room window EXCLUDING the trigger; `triggerBody` is the
 // message that addressed the bot. Bodies are stripped of reply-fallback quotes so
 // quoted third-party text doesn't leak.
-func buildContext(system string, history []bufferedMsg, isDM bool, triggerBody string, maxEvents, maxTokens int) []xaiMessage {
+func buildContext(system string, history []bufferedMsg, isDM bool, triggerBody string, maxEvents, maxTokens int) []Message {
-	msgs := []xaiMessage{{Role: "system", Content: system}}
+	msgs := []Message{{Role: "system", Content: system}}
 	// Keep at most the last maxEvents history items.
 	if len(history) > maxEvents {
@ -34,16 +34,16 @@ func buildContext(system string, history []bufferedMsg, isDM bool, triggerBody s
 			continue
 		}
 		if h.isBot {
-			msgs = append(msgs, xaiMessage{Role: "assistant", Content: body})
+			msgs = append(msgs, Message{Role: "assistant", Content: body})
 			continue
 		}
 		if isDM {
-			msgs = append(msgs, xaiMessage{Role: "user", Content: body})
+			msgs = append(msgs, Message{Role: "user", Content: body})
 		}
 		// group + non-bot history → dropped (privacy minimisation)
 	}
-	msgs = append(msgs, xaiMessage{Role: "user", Content: stripReplyFallback(triggerBody)})
+	msgs = append(msgs, Message{Role: "user", Content: stripReplyFallback(triggerBody)})
 	return truncateToTokens(msgs, maxTokens)
 }
@ -57,7 +57,7 @@ func estimateTokens(s string) int {
 // truncateToTokens drops the oldest non-system, non-final messages until the
 // estimate fits maxTokens. The system prompt (index 0) and the final user
 // trigger are always preserved.
-func truncateToTokens(msgs []xaiMessage, maxTokens int) []xaiMessage {
+func truncateToTokens(msgs []Message, maxTokens int) []Message {
 	total := 0
 	for _, m := range msgs {
 		total += estimateTokens(m.Content)
--- a/apps/ai-bot/httpllm.go
+++ b/apps/ai-bot/httpllm.go
@ -0,0 +1,200 @@
 package main
 import (
 	"bytes"
 	"context"
 	"encoding/json"
 	"fmt"
 	"io"
 	"log/slog"
 	"math/rand"
 	"net/http"
 	"time"
 )
 // httpllm.go is the shared OpenAI-compatible Chat Completions transport: one
 // HTTP+retry implementation reused by every provider adapter. Grok and Gemini both
 // expose this wire format, so the retry/backoff classification (429/5xx/network =
 // retryable, other 4xx = terminal) lives once here, parameterised by base/key/
 // headers, instead of being copied per provider.
 // openAIClient performs OpenAI-compatible /chat/completions calls with retry.
 type openAIClient struct {
 	name    string // provider label for logs/errors ("xai", "gemini")
 	base    string
 	key     string
 	http    *http.Client
 	maxTry  int
 	headers map[string]string // extra static headers (provider-specific), may be nil
 	log     *slog.Logger
 }
 func newOpenAIClient(name, base, key string, headers map[string]string, logger *slog.Logger) *openAIClient {
 	return &openAIClient{
 		name:    name,
 		base:    base,
 		key:     key,
 		http:    &http.Client{},
 		maxTry:  3,
 		headers: headers,
 		log:     logger,
 	}
 }
 // --- OpenAI-compatible wire types -------------------------------------------------
 type openAIMessage struct {
 	Role    string `json:"role"`
 	Content string `json:"content"`
 }
 // openAITool is the wire shape of a model tool (e.g. web search). Only serialized
 // when the request carries tools, so a plain completion's body is unchanged.
 type openAITool struct {
 	Type string `json:"type"`
 }
 type openAIRequest struct {
 	Model       string          `json:"model"`
 	Messages    []openAIMessage `json:"messages"`
 	MaxTokens   int             `json:"max_tokens"`
 	Temperature float64         `json:"temperature"`
 	Stream      bool            `json:"stream"`
 	// Optional; omitempty keeps the grok_direct body byte-identical to before.
 	Tools           []openAITool `json:"tools,omitempty"`
 	ReasoningEffort string       `json:"reasoning_effort,omitempty"`
 	// SearchParameters drives xAI Live Search on chat/completions (the web route's
 	// grok_web_search provider). nil for every non-web call, so it serializes away.
 	SearchParameters any `json:"search_parameters,omitempty"`
 }
 type openAIUsage struct {
 	PromptTokens        int `json:"prompt_tokens"`
 	CompletionTokens    int `json:"completion_tokens"`
 	PromptTokensDetails struct {
 		CachedTokens int `json:"cached_tokens"`
 	} `json:"prompt_tokens_details"`
 }
 type openAIResponse struct {
 	ID      string `json:"id"`
 	Choices []struct {
 		Message struct {
 			Content string `json:"content"`
 		} `json:"message"`
 		FinishReason string `json:"finish_reason"`
 	} `json:"choices"`
 	Usage openAIUsage `json:"usage"`
 	// Citations is the source list xAI Live Search returns by default (absent on a
 	// non-web call → nil).
 	Citations []string `json:"citations"`
 }
 func (r *openAIResponse) Text() string {
 	if len(r.Choices) == 0 {
 		return ""
 	}
 	return r.Choices[0].Message.Content
 }
 // complete calls Chat Completions with retry on transient failures (429 / 5xx /
 // network timeout, exponential backoff + jitter). Non-retryable 4xx fail
 // immediately. On exhaustion the caller refunds the reserved request and notifies
 // the user, so a transient failure is never silently swallowed (F6). reqHeaders are
 // per-request headers (e.g. x-grok-conv-id) merged on top of the static ones; nil is
 // fine.
 func (c *openAIClient) complete(ctx context.Context, reqBody openAIRequest, reqHeaders map[string]string) (*openAIResponse, error) {
 	payload, err := json.Marshal(reqBody)
 	if err != nil {
 		return nil, err
 	}
 	var lastErr error
 	for attempt := 0; attempt < c.maxTry; attempt++ {
 		if attempt > 0 {
 			// 0.5s, 1s, 2s … capped at 8s, plus up to 250ms jitter.
 			backoff := time.Duration(500<<uint(attempt-1)) * time.Millisecond
 			if backoff > 8*time.Second {
 				backoff = 8 * time.Second
 			}
 			backoff += time.Duration(rand.Intn(250)) * time.Millisecond
 			select {
 			case <-ctx.Done():
 				return nil, ctx.Err()
 			case <-time.After(backoff):
 			}
 		}
 		resp, retryable, err := c.attempt(ctx, payload, reqHeaders)
 		if err == nil {
 			return resp, nil
 		}
 		lastErr = err
 		if ctx.Err() != nil {
 			return nil, ctx.Err()
 		}
 		if !retryable {
 			return nil, err
 		}
 		if c.log != nil {
 			c.log.Warn(c.name+" attempt failed, will retry", "attempt", attempt+1, "max", c.maxTry, "err", err)
 		}
 	}
 	return nil, fmt.Errorf("%s: exhausted %d attempts: %w", c.name, c.maxTry, lastErr)
 }
 // attempt performs one HTTP call. It returns retryable=true for 429/5xx and
 // network errors, false for other non-2xx (terminal 4xx). The per-attempt deadline
 // bounds a single hung connection; the overall per-request deadline (set by the
 // caller via ctx) bounds the whole retry loop so a cascade can't accrete minutes.
 func (c *openAIClient) attempt(ctx context.Context, payload []byte, reqHeaders map[string]string) (*openAIResponse, bool, error) {
 	attemptCtx, cancel := context.WithTimeout(ctx, 60*time.Second)
 	defer cancel()
 	req, err := http.NewRequestWithContext(attemptCtx, http.MethodPost, c.base+"/chat/completions", bytes.NewReader(payload))
 	if err != nil {
 		return nil, false, err
 	}
 	req.Header.Set("Content-Type", "application/json")
 	req.Header.Set("Authorization", "Bearer "+c.key)
 	for k, v := range c.headers {
 		req.Header.Set(k, v)
 	}
 	for k, v := range reqHeaders {
 		req.Header.Set(k, v)
 	}
 	resp, err := c.http.Do(req)
 	if err != nil {
 		// Network error / timeout — retryable (unless the parent ctx is done).
 		return nil, ctx.Err() == nil, err
 	}
 	defer resp.Body.Close()
 	data, _ := io.ReadAll(resp.Body)
 	if resp.StatusCode == http.StatusTooManyRequests || resp.StatusCode >= 500 {
 		return nil, true, fmt.Errorf("%s http %d: %s", c.name, resp.StatusCode, snippet(data))
 	}
 	if resp.StatusCode < 200 || resp.StatusCode >= 300 {
 		return nil, false, fmt.Errorf("%s http %d: %s", c.name, resp.StatusCode, snippet(data))
 	}
 	var out openAIResponse
 	if err := json.Unmarshal(data, &out); err != nil {
 		return nil, false, fmt.Errorf("%s decode: %w", c.name, err)
 	}
 	// A 2xx is a billed call even when the model returns empty content (content
 	// filter, finish_reason=length with no text, or no choices). Return it as a
 	// success so the caller books the real cost via the ledger instead of refunding
 	// the slot and losing the spend — which would let empty replies bypass BOTH the
 	// per-user cap and the global ceiling. The caller just won't send an empty body.
 	return &out, false, nil
 }
 func snippet(b []byte) string {
 	const max = 300
 	if len(b) > max {
 		return string(b[:max]) + "…"
 	}
 	return string(b)
 }
--- a/apps/ai-bot/llm.go
+++ b/apps/ai-bot/llm.go
@ -0,0 +1,65 @@
 package main
 import "context"
 // llm.go is the provider-neutral seam between the bot's business logic and the
 // concrete model backends. Nothing here names a vendor: the bot composes its
 // context, prices usage, and books spend against these types, and a thin adapter
 // (provider_xai.go, provider_gemini.go) maps them to/from each backend's wire
 // format. This is what lets a second model (Gemini) slot in behind a flag without
 // the business logic learning a new shape.
 // Message is one provider-neutral chat turn.
 type Message struct {
 	Role    string // "system" | "user" | "assistant"
 	Content string
 }
 // Usage is the provider-neutral token accounting returned with a completion. It
 // drives billing (computeUSD) — the counts are the API's own, authoritative even
 // if our price constants drift.
 type Usage struct {
 	PromptTokens     int
 	CachedTokens     int // subset of PromptTokens served from the provider's prompt cache
 	CompletionTokens int
 }
 // Tool is a provider-neutral tool the model may invoke (e.g. web search). Empty
 // today; the web-freshness layer (Phase 3) populates it. Carried here so the
 // request type is stable across phases.
 type Tool struct {
 	// Type names the tool, e.g. "web_search". Adapters translate it to each
 	// backend's tool wire shape.
 	Type string
 }
 // LLMRequest is a provider-neutral completion request. New optional fields (Tools,
 // ReasoningEffort) serialize away when empty, so a plain grok_direct call produces
 // exactly the same wire body it did before this seam existed.
 type LLMRequest struct {
 	Model           string
 	Messages        []Message
 	MaxTokens       int
 	Temperature     float64
 	Tools           []Tool // optional; populated by the web layer
 	ReasoningEffort string // optional; "" = default, e.g. "low"|"high" for the reasoning route
 	// ConvID is an optional prompt-cache routing hint. Adapters that support it (xAI's
 	// x-grok-conv-id) pin a conversation to one backend to raise cache hit rate; "" =
 	// don't send it. It is a header, not part of the request body, so it never changes
 	// the wire body and an unset value is a no-op.
 	ConvID string
 }
 // LLMResponse is a provider-neutral completion result.
 type LLMResponse struct {
 	Text              string
 	Usage             Usage
 	ProviderRequestID string // the backend's response id, logged for support/debug
 }
 // LLMClient is any chat-completion backend (Grok, Gemini, …). Implementations are
 // thin adapters over a wire protocol; the bot depends only on this interface, so
 // Bot.llm can be swapped or routed without touching business logic.
 type LLMClient interface {
 	Complete(ctx context.Context, req LLMRequest) (*LLMResponse, error)
 }
--- a/apps/ai-bot/main.go
+++ b/apps/ai-bot/main.go
@ -11,7 +11,6 @@ import (
 	"fmt"
 	"os"
 	"os/signal"
 	"path/filepath"
 	"syscall"
 )
@ -91,8 +90,3 @@ func main() {
 	}
 	logger.Info("shut down cleanly")
 }
 // statePath joins a filename under the configured state directory.
 func (c *Config) statePath(name string) string {
 	return filepath.Join(c.StateDir, name)
 }
--- a/apps/ai-bot/pricing.go
+++ b/apps/ai-bot/pricing.go
@ -0,0 +1,42 @@
 package main
 // pricing.go centralises model pricing as a per-model table (the LiteLLM pattern)
 // instead of three hardcoded Grok fields. The spend ledger prices each call by the
 // model it actually used, so when a second model (Gemini) starts answering some
 // routes, its cost books correctly against the same global ceiling.
 // ModelPrice is the per-1M-token USD price for one model, applied to the API's
 // returned usage so the wallet ceiling tracks real cost even as prices change.
 type ModelPrice struct {
 	InputPerM  float64 // non-cached prompt tokens
 	CachedPerM float64 // prompt tokens served from the provider cache (cheaper)
 	OutputPerM float64 // completion tokens
 }
 // CostBreakdown is the per-component USD cost of answering one request. A plain
 // grok_direct call has only Token; a cascade adds Router (the cheap classifier),
 // Grounding (Gemini Google-search) and/or WebTool (Grok web search) on top. Settle
 // books each column separately so the ledger and request_log can attribute spend,
 // and so a half-finished cascade can book only what it actually spent (§8.1).
 type CostBreakdown struct {
 	Token     float64
 	Grounding float64
 	WebTool   float64
 	Router    float64
 }
 // Total is the grand total across all components (the number the wallet ceiling and
 // request_log.total_usd care about). Computed, never stored, so it can't drift.
 func (c CostBreakdown) Total() float64 {
 	return c.Token + c.Grounding + c.WebTool + c.Router
 }
 // priceFor returns the configured price for a model. An unknown model falls back to
 // the default (final-voice) model's price rather than $0 — a $0 price would silently
 // blind the global ceiling to that call, the one failure mode we never want.
 func (c *Config) priceFor(model string) ModelPrice {
 	if p, ok := c.Prices[model]; ok {
 		return p
 	}
 	return c.Prices[c.XAIModel]
 }
--- a/apps/ai-bot/provider_gemini.go
+++ b/apps/ai-bot/provider_gemini.go
@ -0,0 +1,189 @@
 package main
 import (
 	"bytes"
 	"context"
 	"encoding/json"
 	"fmt"
 	"io"
 	"log/slog"
 	"net/http"
 	"net/url"
 	"strings"
 	"time"
 )
 // provider_gemini.go is the Gemini backend. Two faces:
 //
 //   - geminiClient: a thin LLMClient over the OpenAI-compatible endpoint, used for the
 //     cheap trivial route and the Layer-1 router classifier. Same wire format as Grok,
 //     so it reuses the shared transport (httpllm.go).
 //   - groundedSearch: a SEPARATE call against the NATIVE v1beta generateContent endpoint
 //     with the google_search tool. Grounding does NOT work on the OpenAI-compat layer
 //     (it is silently ignored there, and only on Gemini 3+) — verified against Google's
 //     docs (F-EXT-3) — so the web layer that wants Gemini grounding must use this native
 //     path and VERIFY citations came back, else degrade.
 type geminiClient struct {
 	http       *openAIClient
 	nativeBase string // …/v1beta — derived from the OpenAI-compat base by dropping /openai
 	key        string
 	model      string
 	httpc      *http.Client
 	log        *slog.Logger
 }
 // NewGeminiClient builds the Gemini backend. base is the OpenAI-compatible endpoint
 // (…/v1beta/openai); the native grounding endpoint is derived from it. Returns the
 // concrete type (not just LLMClient) because the web layer needs groundedSearch too.
 func NewGeminiClient(base, key, model string, logger *slog.Logger) *geminiClient {
 	return &geminiClient{
 		http:       newOpenAIClient("gemini", base, key, nil, logger),
 		nativeBase: strings.TrimSuffix(base, "/openai"),
 		key:        key,
 		model:      model,
 		httpc:      &http.Client{},
 		log:        logger,
 	}
 }
 // Complete answers via the OpenAI-compatible endpoint (trivial route + classifier).
 func (c *geminiClient) Complete(ctx context.Context, req LLMRequest) (*LLMResponse, error) {
 	msgs := make([]openAIMessage, len(req.Messages))
 	for i, m := range req.Messages {
 		msgs[i] = openAIMessage{Role: m.Role, Content: m.Content}
 	}
 	resp, err := c.http.complete(ctx, openAIRequest{
 		Model:       req.Model,
 		Messages:    msgs,
 		MaxTokens:   req.MaxTokens,
 		Temperature: req.Temperature,
 		Stream:      false,
 	}, nil)
 	if err != nil {
 		return nil, err
 	}
 	return &LLMResponse{
 		Text: resp.Text(),
 		Usage: Usage{
 			PromptTokens:     resp.Usage.PromptTokens,
 			CachedTokens:     resp.Usage.PromptTokensDetails.CachedTokens,
 			CompletionTokens: resp.Usage.CompletionTokens,
 		},
 		ProviderRequestID: resp.ID,
 	}, nil
 }
 // --- native v1beta grounded search (google_search tool) ---------------------------
 type geminiGroundResult struct {
 	Digest    string
 	Citations []string
 	Usage     Usage
 }
 // native generateContent wire types (only the fields we read/write).
 type geminiNativeRequest struct {
 	Contents []geminiContent `json:"contents"`
 	Tools    []geminiTool    `json:"tools"`
 }
 type geminiContent struct {
 	Role  string       `json:"role,omitempty"`
 	Parts []geminiPart `json:"parts"`
 }
 type geminiPart struct {
 	Text string `json:"text"`
 }
 type geminiTool struct {
 	// google_search is the current grounding tool (Gemini 3 / current models). The
 	// empty object enables it.
 	GoogleSearch struct{} `json:"google_search"`
 }
 type geminiNativeResponse struct {
 	Candidates []struct {
 		Content struct {
 			Parts []geminiPart `json:"parts"`
 		} `json:"content"`
 		GroundingMetadata struct {
 			GroundingChunks []struct {
 				Web struct {
 					URI   string `json:"uri"`
 					Title string `json:"title"`
 				} `json:"web"`
 			} `json:"groundingChunks"`
 		} `json:"groundingMetadata"`
 	} `json:"candidates"`
 	UsageMetadata struct {
 		PromptTokenCount        int `json:"promptTokenCount"`
 		CandidatesTokenCount    int `json:"candidatesTokenCount"`
 		CachedContentTokenCount int `json:"cachedContentTokenCount"`
 	} `json:"usageMetadata"`
 }
 // groundedSearch runs one grounded generateContent against the native endpoint and
 // returns the model's grounded answer plus the source URLs. It REQUIRES citations:
 // if groundingMetadata has no chunks the request was not actually grounded (the
 // silent-ignore failure mode, F-EXT-3), so it errors and the caller degrades rather
 // than passing off ungrounded — possibly stale — text as fresh.
 func (c *geminiClient) groundedSearch(ctx context.Context, query string) (geminiGroundResult, error) {
 	body, err := json.Marshal(geminiNativeRequest{
 		Contents: []geminiContent{{Role: "user", Parts: []geminiPart{{Text: query}}}},
 		Tools:    []geminiTool{{}},
 	})
 	if err != nil {
 		return geminiGroundResult{}, err
 	}
 	// API key in the query string is the native v1beta convention.
 	endpoint := fmt.Sprintf("%s/models/%s:generateContent?key=%s",
 		c.nativeBase, url.PathEscape(c.model), url.QueryEscape(c.key))
 	reqCtx, cancel := context.WithTimeout(ctx, 15*time.Second) // web/grounding budget (§8.2.2)
 	defer cancel()
 	req, err := http.NewRequestWithContext(reqCtx, http.MethodPost, endpoint, bytes.NewReader(body))
 	if err != nil {
 		return geminiGroundResult{}, err
 	}
 	req.Header.Set("Content-Type", "application/json")
 	resp, err := c.httpc.Do(req)
 	if err != nil {
 		return geminiGroundResult{}, err
 	}
 	defer resp.Body.Close()
 	data, _ := io.ReadAll(resp.Body)
 	if resp.StatusCode < 200 || resp.StatusCode >= 300 {
 		return geminiGroundResult{}, fmt.Errorf("gemini grounding http %d: %s", resp.StatusCode, snippet(data))
 	}
 	var out geminiNativeResponse
 	if err := json.Unmarshal(data, &out); err != nil {
 		return geminiGroundResult{}, fmt.Errorf("gemini grounding decode: %w", err)
 	}
 	if len(out.Candidates) == 0 {
 		return geminiGroundResult{}, fmt.Errorf("gemini grounding: no candidates")
 	}
 	var sb strings.Builder
 	for _, p := range out.Candidates[0].Content.Parts {
 		sb.WriteString(p.Text)
 	}
 	var citations []string
 	for _, ch := range out.Candidates[0].GroundingMetadata.GroundingChunks {
 		if ch.Web.URI != "" {
 			citations = append(citations, ch.Web.URI)
 		}
 	}
 	// The verify-gate: no citations ⇒ not actually grounded ⇒ degrade.
 	if len(citations) == 0 {
 		return geminiGroundResult{}, fmt.Errorf("gemini grounding: no citations (ungrounded — degrade)")
 	}
 	return geminiGroundResult{
 		Digest:    strings.TrimSpace(sb.String()),
 		Citations: citations,
 		Usage: Usage{
 			PromptTokens:     out.UsageMetadata.PromptTokenCount,
 			CachedTokens:     out.UsageMetadata.CachedContentTokenCount,
 			CompletionTokens: out.UsageMetadata.CandidatesTokenCount,
 		},
 	}, nil
 }
--- a/apps/ai-bot/provider_xai.go
+++ b/apps/ai-bot/provider_xai.go
@ -0,0 +1,62 @@
 package main
 import (
 	"context"
 	"log/slog"
 )
 // provider_xai.go is the thin adapter for xAI's Grok backend. xAI speaks the
 // OpenAI Chat Completions wire format, so this is a shell over the shared
 // openAIClient transport (httpllm.go): it only maps the neutral LLMRequest/
 // LLMResponse to/from the wire types. Any xAI-specific request shaping would live
 // here, but Grok needs none today.
 type xaiClient struct {
 	http *openAIClient
 }
 // NewXAIClient builds the Grok backend. Returns the neutral LLMClient so the bot
 // holds no vendor type.
 func NewXAIClient(base, key string, logger *slog.Logger) LLMClient {
 	return &xaiClient{http: newOpenAIClient("xai", base, key, nil, logger)}
 }
 func (c *xaiClient) Complete(ctx context.Context, req LLMRequest) (*LLMResponse, error) {
 	msgs := make([]openAIMessage, len(req.Messages))
 	for i, m := range req.Messages {
 		msgs[i] = openAIMessage{Role: m.Role, Content: m.Content}
 	}
 	var tools []openAITool
 	for _, t := range req.Tools {
 		tools = append(tools, openAITool{Type: t.Type})
 	}
 	// x-grok-conv-id pins this conversation to one backend to raise the prompt-cache
 	// hit rate (caching itself is automatic on xAI). Only sent when set, so the
 	// default path adds no header.
 	var headers map[string]string
 	if req.ConvID != "" {
 		headers = map[string]string{"x-grok-conv-id": req.ConvID}
 	}
 	resp, err := c.http.complete(ctx, openAIRequest{
 		Model:           req.Model,
 		Messages:        msgs,
 		MaxTokens:       req.MaxTokens,
 		Temperature:     req.Temperature,
 		Stream:          false,
 		Tools:           tools,
 		ReasoningEffort: req.ReasoningEffort,
 	}, headers)
 	if err != nil {
 		return nil, err
 	}
 	return &LLMResponse{
 		Text: resp.Text(),
 		Usage: Usage{
 			PromptTokens:     resp.Usage.PromptTokens,
 			CachedTokens:     resp.Usage.PromptTokensDetails.CachedTokens,
 			CompletionTokens: resp.Usage.CompletionTokens,
 		},
 		ProviderRequestID: resp.ID,
 	}, nil
 }
--- a/apps/ai-bot/router.go
+++ b/apps/ai-bot/router.go
@ -0,0 +1,180 @@
 package main
 import (
 	"context"
 	"encoding/json"
 	"regexp"
 	"strings"
 )
 // router.go classifies a message into a route. It runs INSIDE respond() — after the
 // mention/media/foreign/single-flight gates (F-FUNC-7) — so a paid Layer-1 classifier
 // is never spent on a message today's bot drops for free.
 //
 // Two layers, both conservative (doubt → grok_direct, the safe floor that keeps
 // substantive questions on Grok, §8.6):
 //   - Layer-0: free regex heuristics (RU+EN). Always runs when ROUTER_ENABLED.
 //   - Layer-1: a cheap Gemini JSON classifier, consulted ONLY on Layer-0 grok_direct
 //     when ROUTER_CLASSIFIER_ENABLED. Any failure falls back to the Layer-0 verdict.
 // RouterDecision is the route plus the signals behind it (logged for threshold
 // calibration). Only Route/Source/Confidence/NeedsWeb drive behaviour today; the rest
 // are recorded for the offline router-replay eval (§9).
 type RouterDecision struct {
 	Route          string
 	Source         string // heuristic | classifier | default | forced | degraded
 	Confidence     float64
 	NeedsWeb       bool
 	Freshness      string
 	ReasoningLevel string
 	Domain         string
 	Difficulty     string
 }
 // Heuristic patterns. Kept deliberately tight: a false "trivial" leaks a real question
 // to the cheap model, so trivial fires only on short, unmistakable greetings/acks or
 // bare arithmetic. Freshness words route to web (a false web-route only costs a fetch
 // and degrades cleanly — never a wrong answer).
 var (
 	greetingRe   = regexp.MustCompile(`^(привет(ик)?|здравствуй(те)?|хай|прив|ку|добрый\s+(день|вечер|утро)|спасибо|спс|благодарю|пока|ок(ей)?|угу|ага|hello|hi|hey|yo|thanks|thank\s+you|thx|ty|bye|goodbye|ok|okay|cool|nice)[\s!.,)]*$`)
 	arithmeticRe = regexp.MustCompile(`^[\s(]*\d+(\s*[-+*/×÷]\s*\d+)+[\s)=?]*$`)
 	freshnessRe  = regexp.MustCompile(`(новост|сегодня|сейчас|последн|курс\s|погод|котировк|расписани|прогноз|breaking|today|right now|latest|current(ly)?|news|weather|stock price|exchange rate|score)`)
 )
 // routeLayer0 is the free heuristic. Confidence is a rough self-estimate used only for
 // logging/threshold tuning, not control flow.
 func routeLayer0(body string) RouterDecision {
 	s := strings.ToLower(strings.TrimSpace(body))
 	if s == "" {
 		return RouterDecision{Route: routeGrokDirect, Source: "heuristic", Confidence: 0.5}
 	}
 	if freshnessRe.MatchString(s) {
 		return RouterDecision{Route: routeWebThenGrok, Source: "heuristic", Confidence: 0.7, NeedsWeb: true, Freshness: "recent"}
 	}
 	if isTrivial(s) {
 		return RouterDecision{Route: routeTrivial, Source: "heuristic", Confidence: 0.85, Difficulty: "trivial"}
 	}
 	return RouterDecision{Route: routeGrokDirect, Source: "heuristic", Confidence: 0.6}
 }
 // isTrivial: a short greeting/ack or a bare arithmetic expression, with no sign of a
 // real question. Length-bounded so "thanks, now explain quantum tunnelling" is NOT
 // trivial.
 func isTrivial(s string) bool {
 	if arithmeticRe.MatchString(s) {
 		return true
 	}
 	if len(strings.Fields(s)) <= 4 && greetingRe.MatchString(s) {
 		return true
 	}
 	return false
 }
 // classify produces the final RouterDecision for a request. The manual reasoning
 // trigger is honoured independently of the heuristic router (it's a deliberate user
 // signal). Layer-1's cost, when it runs, is accumulated into cost.Router.
 func (b *Bot) classify(ctx context.Context, body string, cost *CostBreakdown) RouterDecision {
 	if b.cfg.ReasoningEnabled && containsTrigger(body, b.cfg.ReasoningTrigger) {
 		return RouterDecision{Route: routeReason, Source: "forced", Confidence: 1, ReasoningLevel: "high"}
 	}
 	if !b.cfg.RouterEnabled {
 		return RouterDecision{Route: routeGrokDirect, Source: "default"}
 	}
 	d := routeLayer0(body)
 	// Layer-1 only refines the uncertain grok_direct verdict, and only if enabled and
 	// the Gemini client exists. Anything else stands on the heuristic.
 	if d.Route != routeGrokDirect || !b.cfg.RouterClassifierEnabled || b.gemini == nil {
 		return d
 	}
 	refined, err := b.routeLayer1(ctx, body, cost)
 	if err != nil {
 		b.log.Warn("layer-1 classifier failed; using heuristic", "err", err)
 		return d // degrade to the heuristic verdict
 	}
 	return refined
 }
 // classifierConfidenceFloor is the bar a Layer-1 escalation OFF the safe floor
 // (trivial/web/reason) must clear. Below it, the verdict is treated as doubt and the
 // request stays on grok_direct — the owner's "substantive stays on Grok" rule (§8.6).
 // A low-confidence "trivial" is exactly the false-trivial voice leak we must not take.
 const classifierConfidenceFloor = 0.8
 // classifierPrompt asks Gemini for a strict JSON verdict. Kept terse to bound tokens.
 const classifierPrompt = `You are a router. Classify the user message into exactly one route and reply with ONLY a JSON object, no prose.
 Routes: "trivial" (greeting/ack/tiny arithmetic), "web" (needs fresh/current facts: news, prices, weather, "today"), "normal" (everything else).
 Schema: {"route":"trivial|web|normal","confidence":0.0-1.0,"needs_web":true|false}
 Message: `
 // routeLayer1 runs the Gemini classifier and parses its JSON. A non-JSON or unknown
 // answer is an error so classify() degrades to the heuristic — the cheap model never
 // gets to silently mis-route by returning garbage.
 func (b *Bot) routeLayer1(ctx context.Context, body string, cost *CostBreakdown) (RouterDecision, error) {
 	resp, err := b.gemini.Complete(ctx, LLMRequest{
 		Model:       b.cfg.GeminiModel,
 		Messages:    []Message{{Role: "user", Content: classifierPrompt + body}},
 		MaxTokens:   60,
 		Temperature: 0,
 	})
 	if err != nil {
 		return RouterDecision{}, err
 	}
 	cost.Router += computeUSD(b.cfg.GeminiModel, resp.Usage, b.cfg)
 	var parsed struct {
 		Route      string  `json:"route"`
 		Confidence float64 `json:"confidence"`
 		NeedsWeb   bool    `json:"needs_web"`
 	}
 	if err := json.Unmarshal([]byte(extractJSON(resp.Text)), &parsed); err != nil {
 		return RouterDecision{}, err
 	}
 	route := normalizeRoute(parsed.Route)
 	// Safe floor: a low-confidence escalation off grok_direct is doubt — keep it on
 	// Grok rather than leak a possibly-substantive question to the cheap model.
 	if route != routeGrokDirect && parsed.Confidence < classifierConfidenceFloor {
 		return RouterDecision{Route: routeGrokDirect, Source: "classifier", Confidence: parsed.Confidence}, nil
 	}
 	return RouterDecision{
 		Route:      route,
 		Source:     "classifier",
 		Confidence: parsed.Confidence,
 		NeedsWeb:   parsed.NeedsWeb || route == routeWebThenGrok,
 	}, nil
 }
 // normalizeRoute maps a classifier label to a route constant, defaulting unknown
 // labels to grok_direct — the safe floor, so a confused classifier never escalates.
 func normalizeRoute(label string) string {
 	switch strings.ToLower(strings.TrimSpace(label)) {
 	case "trivial", "trivial_direct":
 		return routeTrivial
 	case "web", "web_then_grok":
 		return routeWebThenGrok
 	case "reason", "reason_then_grok":
 		return routeReason
 	default:
 		return routeGrokDirect
 	}
 }
 // extractJSON pulls the first {...} object out of a model reply, tolerating prose or
 // code fences around it. Returns "" if none (→ a parse error → degrade).
 func extractJSON(s string) string {
 	i := strings.IndexByte(s, '{')
 	j := strings.LastIndexByte(s, '}')
 	if i < 0 || j < i {
 		return ""
 	}
 	return s[i : j+1]
 }
 // containsTrigger reports whether body contains the manual trigger phrase
 // (case-insensitive, whitespace-trimmed). Empty trigger never matches.
 func containsTrigger(body, trigger string) bool {
 	trigger = strings.TrimSpace(strings.ToLower(trigger))
 	if trigger == "" {
 		return false
 	}
 	return strings.Contains(strings.ToLower(body), trigger)
 }
--- a/apps/ai-bot/router_test.go
+++ b/apps/ai-bot/router_test.go
@ -0,0 +1,81 @@
 package main
 import "testing"
 // TestRouteLayer0 is the heuristic golden set (RU+EN). The critical property is the
 // safe floor: anything substantive must land on grok_direct, and a long message that
 // merely starts with a greeting must NOT be trivial (no leaking real questions to the
 // cheap model, §8.6).
 func TestRouteLayer0(t *testing.T) {
 	cases := []struct {
 		body string
 		want string
 	}{
 		// trivial: short greetings/acks and bare arithmetic
 		{"привет", routeTrivial},
 		{"Привет!", routeTrivial},
 		{"спасибо", routeTrivial},
 		{"спс", routeTrivial},
 		{"ок", routeTrivial},
 		{"hi", routeTrivial},
 		{"hello", routeTrivial},
 		{"thanks", routeTrivial},
 		{"thank you", routeTrivial},
 		{"ok", routeTrivial},
 		{"2+2", routeTrivial},
 		{"100 * 50", routeTrivial},
 		{"12 / 4 - 1", routeTrivial},
 		// web: freshness signals
 		{"какие новости сегодня?", routeWebThenGrok},
 		{"что сейчас происходит в мире", routeWebThenGrok},
 		{"курс доллара сегодня", routeWebThenGrok},
 		{"what's the weather today", routeWebThenGrok},
 		{"latest news on AI", routeWebThenGrok},
 		{"current bitcoin price", routeWebThenGrok},
 		// grok_direct: substantive (the safe floor)
 		{"посоветуй фильм на вечер", routeGrokDirect},
 		{"explain how TCP works", routeGrokDirect},
 		{"расскажи историю римской империи", routeGrokDirect},
 		{"спасибо, а теперь подробно объясни квантовую запутанность", routeGrokDirect}, // starts w/ ack but long
 		{"hi, can you help me debug this Go program?", routeGrokDirect},                // starts w/ hi but a real ask
 		{"напиши функцию сортировки на python", routeGrokDirect},
 	}
 	for _, c := range cases {
 		if got := routeLayer0(c.body).Route; got != c.want {
 			t.Errorf("routeLayer0(%q) = %q, want %q", c.body, got, c.want)
 		}
 	}
 }
 func TestNormalizeRoute(t *testing.T) {
 	cases := map[string]string{
 		"trivial": routeTrivial, "web": routeWebThenGrok, "reason": routeReason,
 		"normal": routeGrokDirect, "garbage": routeGrokDirect, "": routeGrokDirect,
 	}
 	for in, want := range cases {
 		if got := normalizeRoute(in); got != want {
 			t.Errorf("normalizeRoute(%q) = %q, want %q", in, got, want)
 		}
 	}
 }
 func TestExtractJSON(t *testing.T) {
 	if got := extractJSON("prefix {\"route\":\"web\"} suffix"); got != `{"route":"web"}` {
 		t.Errorf("extractJSON = %q", got)
 	}
 	if got := extractJSON("no json here"); got != "" {
 		t.Errorf("extractJSON(no json) = %q, want empty", got)
 	}
 }
 func TestContainsTrigger(t *testing.T) {
 	if !containsTrigger("ну подумай глубже про это", "подумай глубже") {
 		t.Error("should match trigger phrase mid-sentence")
 	}
 	if containsTrigger("just a normal question", "подумай глубже") {
 		t.Error("must not match when phrase absent")
 	}
 	if containsTrigger("anything", "") {
 		t.Error("empty trigger must never match")
 	}
 }
--- a/apps/ai-bot/store.go
+++ b/apps/ai-bot/store.go
@ -2,6 +2,7 @@ package main
 import (
 	"context"
 	"encoding/json"
 	"errors"
 	"fmt"
 	"time"
@ -107,6 +108,64 @@ var migrations = []string{
 		PRIMARY KEY (date, mxid)
 	);
 	CREATE TABLE IF NOT EXISTS warned_encrypted (room_id TEXT PRIMARY KEY);`,
 	// v2: component cost columns + the optimistic reservation column. `reserved_usd`
 	// holds the estimated max-cost of in-flight calls so the global ceiling counts
 	// committed + reserved spend at admission time (the TOCTOU fix, §8.1): without it
 	// a burst of concurrent calls all read the same low committed SUM and slip past
 	// the ceiling, because the USD only lands at settle, AFTER the call. The component
 	// columns let the ceiling see grounding/tool fees too (not just tokens), and feed
 	// the per-component analytics. ADD COLUMN IF NOT EXISTS is idempotent.
 	`ALTER TABLE spend ADD COLUMN IF NOT EXISTS reserved_usd  DOUBLE PRECISION NOT NULL DEFAULT 0;
 	 ALTER TABLE spend ADD COLUMN IF NOT EXISTS router_usd    DOUBLE PRECISION NOT NULL DEFAULT 0;
 	 ALTER TABLE spend ADD COLUMN IF NOT EXISTS grounding_usd DOUBLE PRECISION NOT NULL DEFAULT 0;
 	 ALTER TABLE spend ADD COLUMN IF NOT EXISTS webtool_usd   DOUBLE PRECISION NOT NULL DEFAULT 0;`,
 	// v3: request_log — one row per engaged request, for offline analysis of the route
 	// mix, per-component $/day, latency, escalation/degrade rates (§6.2). Operational,
 	// not message content: query_text is written ONLY when TELEMETRY_STORE_TEXT is on.
 	// Indexed by ts for the time-based retention trim and time-series queries.
 	`CREATE TABLE IF NOT EXISTS request_log (
 		id                  TEXT PRIMARY KEY,
 		ts                  TIMESTAMPTZ NOT NULL DEFAULT now(),
 		room_id             TEXT,
 		sender              TEXT,
 		route               TEXT,
 		router_source       TEXT,
 		router_confidence   REAL,
 		models              JSONB,
 		prompt_tokens       INT,
 		cached_tokens       INT,
 		completion_tokens   INT,
 		token_usd           DOUBLE PRECISION,
 		grounding_usd       DOUBLE PRECISION,
 		router_usd          DOUBLE PRECISION,
 		webtool_usd         DOUBLE PRECISION,
 		total_usd           DOUBLE PRECISION,
 		latency_ms          INT,
 		stage_ms            JSONB,
 		escalated           BOOL DEFAULT false,
 		fallback_fired      BOOL DEFAULT false,
 		cache_hit           BOOL DEFAULT false,
 		ceiling_hit         BOOL DEFAULT false,
 		per_user_cap_hit    BOOL DEFAULT false,
 		prompt_version      TEXT,
 		provider_request_id TEXT,
 		degraded            TEXT DEFAULT '',
 		err                 TEXT DEFAULT '',
 		ok                  BOOL,
 		query_text          TEXT
 	);
 	CREATE INDEX IF NOT EXISTS request_log_ts_idx ON request_log (ts);`,
 	// v4: per-day grounded-prompt counter for the web grounding cap guard (§8.2.3). One
 	// row per UTC day; the cap check + increment is one atomic statement (same TOCTOU
 	// discipline as the spend gate), so a burst can't blow past the $/1k grounding
 	// overage. Day-keyed, so it self-resets and needs no separate trim.
 	`CREATE TABLE IF NOT EXISTS grounding_count (
 		date TEXT PRIMARY KEY,
 		n    INTEGER NOT NULL DEFAULT 0
 	);`,
 }
 // migrate runs all pending migrations on a single connection under a session
@ -207,13 +266,19 @@ func (s *Store) SeenEvent(eventID string) (bool, error) {
 	return true, err
 }
-// SpentTodayUSD sums all spend for the current UTC day. SUM over no rows is NULL,
+// committedUSDExpr sums every COMMITTED cost component of a spend row — tokens plus
-// which scans into a nil *float64 → treated as 0.
+// the grounding/web/router fees a cascade can incur — so the wallet ceiling is never
 // blind to non-token spend. It deliberately excludes reserved_usd (that is in-flight,
 // not yet spent); the admission gate adds reserved separately.
 const committedUSDExpr = `usd + router_usd + grounding_usd + webtool_usd`
 // SpentTodayUSD sums all COMMITTED spend for the current UTC day. SUM over no rows is
 // NULL, which scans into a nil *float64 → treated as 0.
 func (s *Store) SpentTodayUSD() (float64, error) {
 	ctx, cancel := opContext()
 	defer cancel()
 	var v *float64
-	if err := s.pool.QueryRow(ctx, `SELECT SUM(usd) FROM spend WHERE date = $1`, todayUTC()).Scan(&v); err != nil {
+	if err := s.pool.QueryRow(ctx, `SELECT SUM(`+committedUSDExpr+`) FROM spend WHERE date = $1`, todayUTC()).Scan(&v); err != nil {
 		return 0, err
 	}
 	if v == nil {
@ -222,19 +287,30 @@ func (s *Store) SpentTodayUSD() (float64, error) {
 	return *v, nil
 }
-// Reserve runs the two independent gates in one transaction, BEFORE the xAI call
+// reserveDayLockKey namespaces the per-day admission lock so it can't collide with
-// (F4): the global USD ceiling protects the wallet; the per-user request cap is
+// the migration lock or any other advisory lock.
-// anti-abuse. It increments the per-user request count on success; the USD is
+const reserveDayLockKey = "ai-bot:reserve:"
-// reconciled after the response. Order: global first (cheapest to deny), then
+
-// per-user.
+// Reserve runs the two admission gates in one transaction, BEFORE the call (F4): the
 // global USD ceiling protects the wallet; the per-user request cap is anti-abuse. On
 // success it both increments the per-user request count AND books `estimate` (the
 // route's max-cost) into reserved_usd, so the global gate counts committed + reserved
 // spend. The actual USD is settled after the response (Settle), at which point the
 // reservation is released and the real cost booked. Order: global first (cheapest to
 // deny), then per-user.
 //
-// A transaction-scoped advisory lock on (date, mxid) serializes concurrent
+// The check-and-reserve is serialized GLOBALLY for the day by a transaction-scoped
-// reservations for the SAME user+day, so the per-user check-then-increment stays
+// advisory lock keyed on the date (not on date|mxid as the bare port did). This is
-// atomic. The former SQLite store got this for free (one connection serialized all
+// the TOCTOU fix (§8.1): the ceiling reads SUM(committed)+SUM(reserved) and then adds
-// callers); the pgx pool is concurrent, and the same user messaging from two rooms
+// its own reservation atomically, so a burst of DIFFERENT users can overshoot the
-// at once would otherwise be able to slip past the per-user cap. Different users
+// ceiling by at most ONE max-reservation rather than slipping through unbounded — the
-// never contend.
+// per-(date,mxid) lock only serialized one user with himself and left the cross-user
-func (s *Store) Reserve(mxid string, perUserCap int, dailyUSDCeiling float64) (reserveResult, error) {
+// ceiling unprotected. The former SQLite store serialized ALL callers on its single
 // connection anyway, so this restores that exact admission semantics, durably; the
 // bot is low-volume with per-room single-flight, so a per-day admission lock costs
 // nothing observable. Settle/Release run lock-free (they only release/convert spend,
 // never admit).
 func (s *Store) Reserve(mxid string, perUserCap int, perUserUSD, dailyUSDCeiling, estimate float64) (reserveResult, error) {
 	ctx, cancel := opContext()
 	defer cancel()
 	day := todayUTC()
@ -245,41 +321,50 @@ func (s *Store) Reserve(mxid string, perUserCap int, dailyUSDCeiling float64) (r
 	}
 	defer tx.Rollback(ctx)
-	// Key on date|mxid. The separator only needs to avoid cross-key ambiguity; a
+	if _, err := tx.Exec(ctx, `SELECT pg_advisory_xact_lock(hashtextextended($1, 0))`, reserveDayLockKey+day); err != nil {
 	// hash collision would merely over-serialize two unrelated users, never corrupt a
 	// count. (NUL is rejected by Postgres text, so use a printable separator.)
 	if _, err := tx.Exec(ctx, `SELECT pg_advisory_xact_lock(hashtextextended($1, 0))`, day+"|"+mxid); err != nil {
 		return reserveOK, err
 	}
-	// SUM over zero rows is NULL → nil pointer → treat as 0.0, exactly as the SQLite
+	// committed + reserved. SUM over zero rows is NULL → nil pointer → treat as 0.0,
-	// store's sql.NullFloat64 did (and as SpentTodayUSD does). This keeps the gate 1:1
+	// exactly as the SQLite store's sql.NullFloat64 did. This keeps the gate 1:1 even
-	// even at the degenerate dailyUSDCeiling == 0 (deny everything), where 0 >= 0.
+	// at the degenerate dailyUSDCeiling == 0 (deny everything), where 0 >= 0.
-	var global *float64
+	var inFlight *float64
-	if err := tx.QueryRow(ctx, `SELECT SUM(usd) FROM spend WHERE date = $1`, day).Scan(&global); err != nil {
+	if err := tx.QueryRow(ctx,
 		`SELECT SUM(`+committedUSDExpr+` + reserved_usd) FROM spend WHERE date = $1`, day).Scan(&inFlight); err != nil {
 		return reserveOK, err
 	}
 	spentToday := 0.0
-	if global != nil {
+	if inFlight != nil {
-		spentToday = *global
+		spentToday = *inFlight
 	}
 	if spentToday >= dailyUSDCeiling {
 		return reserveDeniedGlobal, nil
 	}
 	// Per-user row: read requests AND the user's own committed+reserved $ in one go, so
 	// both per-user gates are checked under the same lock. ErrNoRows → first request of
 	// the day for this user → all zero.
 	var requests int
-	err = tx.QueryRow(ctx, `SELECT requests FROM spend WHERE date = $1 AND mxid = $2`, day, mxid).Scan(&requests)
+	var userUSD float64
 	err = tx.QueryRow(ctx,
 		`SELECT requests, `+committedUSDExpr+` + reserved_usd FROM spend WHERE date = $1 AND mxid = $2`,
 		day, mxid).Scan(&requests, &userUSD)
 	if err != nil && !errors.Is(err, pgx.ErrNoRows) {
 		return reserveOK, err
 	}
 	if requests >= perUserCap {
 		return reserveDeniedUser, nil
 	}
 	// Optional per-user $ quota (0 = off): keep one user from draining the shared ceiling.
 	if perUserUSD > 0 && userUSD >= perUserUSD {
 		return reserveDeniedUser, nil
 	}
 	if _, err := tx.Exec(ctx,
-		`INSERT INTO spend (date, mxid, requests, usd) VALUES ($1, $2, 1, 0)
+		`INSERT INTO spend (date, mxid, requests, reserved_usd) VALUES ($1, $2, 1, $3)
-		 ON CONFLICT (date, mxid) DO UPDATE SET requests = spend.requests + 1`,
+		 ON CONFLICT (date, mxid) DO UPDATE SET requests = spend.requests + 1,
-		day, mxid); err != nil {
+		   reserved_usd = spend.reserved_usd + excluded.reserved_usd`,
 		day, mxid, estimate); err != nil {
 		return reserveOK, err
 	}
 	if err := tx.Commit(ctx); err != nil {
@ -288,10 +373,11 @@ func (s *Store) Reserve(mxid string, perUserCap int, dailyUSDCeiling float64) (r
 	return reserveOK, nil
 }
-// RefundRequest gives back a reserved request slot when the call ultimately
+// RefundRequest gives back a reserved request SLOT when the call ultimately failed
-// failed (e.g. an xAI outage), so a transient failure doesn't burn the user's
+// (an outage) or the reply couldn't be delivered (paid silence, §8.1), so a transient
-// daily cap. Never drops below zero. A single UPDATE is atomic, so concurrent
+// failure doesn't burn the user's daily cap. It does NOT touch USD: a 2xx is really
-// refunds settle correctly without extra locking.
+// billed even if we then fail to deliver. Never drops below zero. A single UPDATE is
 // atomic, so concurrent refunds settle correctly without extra locking.
 func (s *Store) RefundRequest(mxid string) error {
 	ctx, cancel := opContext()
 	defer cancel()
@ -301,19 +387,128 @@ func (s *Store) RefundRequest(mxid string) error {
 	return err
 }
-// Reconcile books the actual USD cost of a completed call against the user's
+// ReleaseReservation frees a reservation whose request produced no billable spend,
-// daily row (and thus the global total). The accumulating upsert is atomic and
+// restoring the global headroom without booking anything. The normal failure paths
-// commutative, so concurrent reconciles for the same user sum correctly.
+// settle via Settle (which also releases), so this is the safety valve for an
-func (s *Store) Reconcile(mxid string, usd float64) error {
+// UNSETTLED exit — a panic in generation, recovered by safego — where it runs with
 // RefundRequest in respond's deferred guard so a leaked reservation can't drift the
 // ceiling. GREATEST(0, …) guards against a double-release driving reserved_usd
 // negative. Lock-free: it only lowers the in-flight reserved total, never admits.
 func (s *Store) ReleaseReservation(mxid string, estimate float64) error {
 	ctx, cancel := opContext()
 	defer cancel()
 	_, err := s.pool.Exec(ctx,
-		`INSERT INTO spend (date, mxid, requests, usd) VALUES ($1, $2, 0, $3)
+		`UPDATE spend SET reserved_usd = GREATEST(0, reserved_usd - $3) WHERE date = $1 AND mxid = $2`,
-		 ON CONFLICT (date, mxid) DO UPDATE SET usd = spend.usd + excluded.usd`,
+		todayUTC(), mxid, estimate)
 		todayUTC(), mxid, usd)
 	return err
 }
 // Settle releases a call's reservation and books its ACTUAL cost in one atomic step
 // (replacing the old additive Reconcile): reserved_usd drops by the reservation while
 // the real per-component cost is added to the committed columns. This is non-additive
 // on the reservation (settle, not accumulate), the semantics the ceiling needs. It is
 // also the partial-cascade-refund primitive (§8.1): a web_then_grok call that paid
 // grounding but failed at the final model passes a CostBreakdown carrying only the
 // grounding it actually spent, releases the rest of the reservation, and refunds the
 // request slot separately. GREATEST(0, …) keeps reserved_usd from underflowing.
 // Atomic and commutative per row, so concurrent settles for one user sum correctly.
 func (s *Store) Settle(mxid string, estimate float64, cost CostBreakdown) error {
 	ctx, cancel := opContext()
 	defer cancel()
 	_, err := s.pool.Exec(ctx,
 		`INSERT INTO spend (date, mxid, requests, usd, router_usd, grounding_usd, webtool_usd, reserved_usd)
 		 VALUES ($1, $2, 0, $3, $4, $5, $6, 0)
 		 ON CONFLICT (date, mxid) DO UPDATE SET
 		   usd           = spend.usd           + excluded.usd,
 		   router_usd    = spend.router_usd    + excluded.router_usd,
 		   grounding_usd = spend.grounding_usd + excluded.grounding_usd,
 		   webtool_usd   = spend.webtool_usd   + excluded.webtool_usd,
 		   reserved_usd  = GREATEST(0, spend.reserved_usd - $7)`,
 		todayUTC(), mxid, cost.Token, cost.Router, cost.Grounding, cost.WebTool, estimate)
 	return err
 }
 // InsertRequestLog writes one analytics row. id is the event id (PRIMARY KEY), so a
 // re-logged event is a no-op (ON CONFLICT DO NOTHING) — each event takes exactly one
 // terminal path, so this never overwrites a real outcome. The write is isolated: the
 // caller runs it off the answer path and only logs a failure, never drops the reply.
 func (s *Store) InsertRequestLog(rl RequestLog) error {
 	ctx, cancel := opContext()
 	defer cancel()
 	models, err := json.Marshal(rl.Models)
 	if err != nil {
 		return err
 	}
 	stages, err := json.Marshal(rl.StageMS)
 	if err != nil {
 		return err
 	}
 	// query_text is NULL unless text capture is on (the struct carries "" otherwise),
 	// so the analytics table never holds message content by default.
 	var queryText any
 	if rl.QueryText != "" {
 		queryText = rl.QueryText
 	}
 	_, err = s.pool.Exec(ctx, `
 		INSERT INTO request_log (
 			id, room_id, sender, route, router_source, router_confidence, models,
 			prompt_tokens, cached_tokens, completion_tokens,
 			token_usd, grounding_usd, router_usd, webtool_usd, total_usd,
 			latency_ms, stage_ms, escalated, fallback_fired, cache_hit, ceiling_hit,
 			per_user_cap_hit, prompt_version, provider_request_id, degraded, err, ok, query_text
 		) VALUES (
 			$1, $2, $3, $4, $5, $6, $7,
 			$8, $9, $10,
 			$11, $12, $13, $14, $15,
 			$16, $17, $18, $19, $20, $21,
 			$22, $23, $24, $25, $26, $27, $28
 		) ON CONFLICT (id) DO NOTHING`,
 		rl.ID, rl.RoomID, rl.Sender, rl.Route, rl.RouterSource, rl.RouterConfidence, models,
 		rl.PromptTokens, rl.CachedTokens, rl.CompletionTokens,
 		rl.Cost.Token, rl.Cost.Grounding, rl.Cost.Router, rl.Cost.WebTool, rl.Cost.Total(),
 		rl.LatencyMS, stages, rl.Escalated, rl.FallbackFired, rl.CacheHit, rl.CeilingHit,
 		rl.PerUserCapHit, rl.PromptVersion, rl.ProviderRequestID, rl.Degraded, rl.Err, rl.OK, queryText)
 	return err
 }
 // TrimRequestLog deletes analytics rows older than the cutoff (time-based, since the
 // data is a time series — unlike the count-bounded dedup tables). A no-op for a zero
 // cutoff. Cheap given the ts index.
 func (s *Store) TrimRequestLog(olderThan time.Time) error {
 	ctx, cancel := opContext()
 	defer cancel()
 	_, err := s.pool.Exec(ctx, `DELETE FROM request_log WHERE ts < $1`, olderThan)
 	return err
 }
 // IncrGroundingIfUnder atomically admits one grounded prompt for today if the day's
 // count is below cap, returning whether it was admitted. The check-and-increment is a
 // single statement, so concurrent grounding calls can't race past the cap and into the
 // per-1k overage (§8.2.3). A non-positive cap denies everything (grounding effectively
 // off). The counter is day-keyed and self-resets at UTC midnight.
 func (s *Store) IncrGroundingIfUnder(cap int) (bool, error) {
 	if cap <= 0 {
 		return false, nil
 	}
 	ctx, cancel := opContext()
 	defer cancel()
 	var n int
 	err := s.pool.QueryRow(ctx, `
 		INSERT INTO grounding_count (date, n) VALUES ($1, 1)
 		ON CONFLICT (date) DO UPDATE SET n = grounding_count.n + 1
 		WHERE grounding_count.n < $2
 		RETURNING n`, todayUTC(), cap).Scan(&n)
 	if errors.Is(err, pgx.ErrNoRows) {
 		return false, nil // at/over cap — the conflict update was filtered out
 	}
 	if err != nil {
 		return false, err
 	}
 	return true, nil
 }
 // HasWarnedEncrypted / SetWarnedEncrypted persist the one-shot "reacted 🔒 to this
 // room because I can't read encryption" flag so a restart doesn't re-react on every
 // message (F5). The bot never reacts to its own events: m.reaction is not an
--- a/apps/ai-bot/store_test.go
+++ b/apps/ai-bot/store_test.go
@ -1,9 +1,11 @@
 package main
 import (
 	"fmt"
 	"sync"
 	"sync/atomic"
 	"testing"
 	"time"
 )
 // These tests exercise the Postgres-backed store directly. They run only when
@ -84,23 +86,23 @@ func TestStoreLimiterPerUserCap(t *testing.T) {
 	const cap, ceiling = 2, 100.0
 	for i := 0; i < cap; i++ {
-		if res, err := st.Reserve(user, cap, ceiling); err != nil || res != reserveOK {
+		if res, err := st.Reserve(user, cap, 0, ceiling, 0); err != nil || res != reserveOK {
 			t.Fatalf("reserve %d: got (%v,%v), want reserveOK", i, res, err)
 		}
 	}
 	// The (cap+1)th request is denied per-user.
-	if res, err := st.Reserve(user, cap, ceiling); err != nil || res != reserveDeniedUser {
+	if res, err := st.Reserve(user, cap, 0, ceiling, 0); err != nil || res != reserveDeniedUser {
 		t.Fatalf("over-cap reserve: got (%v,%v), want reserveDeniedUser", res, err)
 	}
 	// A different user is unaffected.
-	if res, err := st.Reserve("@v:vojo.chat", cap, ceiling); err != nil || res != reserveOK {
+	if res, err := st.Reserve("@v:vojo.chat", cap, 0, ceiling, 0); err != nil || res != reserveOK {
 		t.Fatalf("other user reserve: got (%v,%v), want reserveOK", res, err)
 	}
 	// Refund returns a slot, so the first user can reserve once more.
 	if err := st.RefundRequest(user); err != nil {
 		t.Fatalf("refund: %v", err)
 	}
-	if res, err := st.Reserve(user, cap, ceiling); err != nil || res != reserveOK {
+	if res, err := st.Reserve(user, cap, 0, ceiling, 0); err != nil || res != reserveOK {
 		t.Fatalf("post-refund reserve: got (%v,%v), want reserveOK", res, err)
 	}
 }
@ -110,7 +112,7 @@ func TestStoreLimiterPerUserCap(t *testing.T) {
 func TestStoreLimiterZeroCap(t *testing.T) {
 	st := openTestStore(t)
 	defer st.Close()
-	if res, err := st.Reserve("@u:vojo.chat", 0, 100.0); err != nil || res != reserveDeniedUser {
+	if res, err := st.Reserve("@u:vojo.chat", 0, 0, 100.0, 0); err != nil || res != reserveDeniedUser {
 		t.Fatalf("zero-cap reserve: got (%v,%v), want reserveDeniedUser", res, err)
 	}
 }
@ -121,7 +123,7 @@ func TestStoreLimiterZeroCap(t *testing.T) {
 func TestStoreLimiterZeroCeiling(t *testing.T) {
 	st := openTestStore(t)
 	defer st.Close()
-	if res, err := st.Reserve("@u:vojo.chat", 1_000_000, 0); err != nil || res != reserveDeniedGlobal {
+	if res, err := st.Reserve("@u:vojo.chat", 1_000_000, 0, 0, 0); err != nil || res != reserveDeniedGlobal {
 		t.Fatalf("zero-ceiling reserve on empty store: got (%v,%v), want reserveDeniedGlobal", res, err)
 	}
 }
@ -131,18 +133,18 @@ func TestStoreLimiterGlobalCeiling(t *testing.T) {
 	defer st.Close()
 	const ceiling = 1.0
-	// Book spend up to the ceiling (Reconcile is what feeds the global gate).
+	// Book spend up to the ceiling (Settle is what feeds the global gate).
-	if err := st.Reconcile("@a:vojo.chat", 0.6); err != nil {
+	if err := st.Settle("@a:vojo.chat", 0, CostBreakdown{Token: 0.6}); err != nil {
-		t.Fatalf("reconcile a: %v", err)
+		t.Fatalf("settle a: %v", err)
 	}
-	if err := st.Reconcile("@b:vojo.chat", 0.5); err != nil {
+	if err := st.Settle("@b:vojo.chat", 0, CostBreakdown{Token: 0.5}); err != nil {
-		t.Fatalf("reconcile b: %v", err)
+		t.Fatalf("settle b: %v", err)
 	}
 	if spent, err := st.SpentTodayUSD(); err != nil || spent < 1.1 {
 		t.Fatalf("spent today: got (%v,%v), want >= 1.1", spent, err)
 	}
 	// Now any reservation is denied globally, regardless of the per-user cap.
-	if res, err := st.Reserve("@c:vojo.chat", 1_000_000, ceiling); err != nil || res != reserveDeniedGlobal {
+	if res, err := st.Reserve("@c:vojo.chat", 1_000_000, 0, ceiling, 0); err != nil || res != reserveDeniedGlobal {
 		t.Fatalf("over-ceiling reserve: got (%v,%v), want reserveDeniedGlobal", res, err)
 	}
 }
@ -165,7 +167,7 @@ func TestStoreReserveConcurrentRespectsCap(t *testing.T) {
 		wg.Add(1)
 		go func() {
 			defer wg.Done()
-			res, err := st.Reserve(user, cap, 1e9)
+			res, err := st.Reserve(user, cap, 0, 1e9, 0)
 			if err != nil {
 				t.Errorf("reserve: %v", err)
 				return
@ -181,6 +183,266 @@ func TestStoreReserveConcurrentRespectsCap(t *testing.T) {
 	}
 }
 // TestStoreReserveConcurrentCeilingBounded is the §8.1 TOCTOU regression. Many
 // DIFFERENT users reserving at once against a low ceiling must not overshoot it by
 // more than ONE max-reservation. The bare pgx port's per-(date,mxid) lock left the
 // cross-user ceiling unprotected: every user read the same committed SUM(usd)=0 (the
 // USD only lands at settle, after the call) and slipped through, so all N were
 // admitted. The per-day admission lock + reserved_usd here bound the overshoot.
 // Run under -race.
 func TestStoreReserveConcurrentCeilingBounded(t *testing.T) {
 	st := openTestStore(t)
 	defer st.Close()
 	const estimate = 1.0         // each in-flight call reserves $1
 	const ceiling = 10.0         // so the gate should admit ~10, not 100
 	const perUserCap = 1_000_000 // keep the per-user cap out of the way
 	const goroutines = 100
 	var ok int64
 	var wg sync.WaitGroup
 	for i := 0; i < goroutines; i++ {
 		wg.Add(1)
 		go func(n int) {
 			defer wg.Done()
 			user := fmt.Sprintf("@u%d:vojo.chat", n) // a DIFFERENT user each time
 			res, err := st.Reserve(user, perUserCap, 0, ceiling, estimate)
 			if err != nil {
 				t.Errorf("reserve: %v", err)
 				return
 			}
 			if res == reserveOK {
 				atomic.AddInt64(&ok, 1)
 			}
 		}(i)
 	}
 	wg.Wait()
 	// committed+reserved < ceiling admits; the last admit can push reserved to just
 	// under ceiling+estimate, so admitted ≤ ceiling/estimate + 1. The pre-fix code
 	// admitted all 100.
 	maxAdmit := int64(ceiling/estimate) + 1
 	if ok < 1 || ok > maxAdmit {
 		t.Fatalf("admitted %d different users, want in [1, %d] (ceiling + one max-reserve)", ok, maxAdmit)
 	}
 	// Nothing was settled, so committed spend is still 0 — the cap came purely from
 	// reservations, which is the whole point (the USD isn't known until after the call).
 	if spent, err := st.SpentTodayUSD(); err != nil || spent != 0 {
 		t.Fatalf("committed spend = (%v,%v), want 0 (only reservations held)", spent, err)
 	}
 }
 // TestStoreSettleReleasesReservation verifies that Settle frees the reservation it
 // books actual cost for, restoring global headroom — proven through the admission
 // gate so it doesn't depend on reading the private column.
 func TestStoreSettleReleasesReservation(t *testing.T) {
 	st := openTestStore(t)
 	defer st.Close()
 	const est = 5.0
 	const ceiling = 10.0
 	// Two reservations fill the ceiling (reserved 5 + 5 = 10); the third is denied.
 	if res, _ := st.Reserve("@a:vojo.chat", 1_000_000, 0, ceiling, est); res != reserveOK {
 		t.Fatalf("reserve a: %v", res)
 	}
 	if res, _ := st.Reserve("@b:vojo.chat", 1_000_000, 0, ceiling, est); res != reserveOK {
 		t.Fatalf("reserve b: %v", res)
 	}
 	if res, _ := st.Reserve("@c:vojo.chat", 1_000_000, 0, ceiling, est); res != reserveDeniedGlobal {
 		t.Fatalf("reserve c over full ceiling: got %v, want denied", res)
 	}
 	// Settle a with a small actual cost: reserved 10→5, committed 0→0.01. Headroom
 	// returns, so a new reservation is admitted again.
 	if err := st.Settle("@a:vojo.chat", est, CostBreakdown{Token: 0.01}); err != nil {
 		t.Fatalf("settle a: %v", err)
 	}
 	if res, _ := st.Reserve("@d:vojo.chat", 1_000_000, 0, ceiling, est); res != reserveOK {
 		t.Fatalf("reserve d after settle freed headroom: got %v, want reserveOK", res)
 	}
 	if spent, _ := st.SpentTodayUSD(); spent < 0.009 || spent > 0.011 {
 		t.Fatalf("committed after one settle = %v, want ~0.01", spent)
 	}
 }
 // TestStoreReleaseReservation verifies the call-failed path: a released reservation
 // frees headroom and books no USD, and an over-release clamps reserved_usd to 0
 // rather than going negative (a negative reservation would manufacture phantom
 // headroom past the ceiling).
 func TestStoreReleaseReservation(t *testing.T) {
 	st := openTestStore(t)
 	defer st.Close()
 	const est = 5.0
 	const ceiling = 10.0
 	// Reserve a, then over-release it by far more than it held.
 	if res, _ := st.Reserve("@a:vojo.chat", 1_000_000, 0, ceiling, est); res != reserveOK {
 		t.Fatalf("reserve a: %v", res)
 	}
 	if err := st.ReleaseReservation("@a:vojo.chat", 100); err != nil {
 		t.Fatalf("over-release: %v", err)
 	}
 	// a's reserved must now be 0 (not -95): exactly two more $5 reservations fit the
 	// $10 ceiling, and the third is denied. Were reserved negative, far more would slip
 	// through — so the deny at the third request proves both the headroom was freed and
 	// the clamp held.
 	if res, _ := st.Reserve("@b:vojo.chat", 1_000_000, 0, ceiling, est); res != reserveOK {
 		t.Fatalf("reserve b: %v", res)
 	}
 	if res, _ := st.Reserve("@c:vojo.chat", 1_000_000, 0, ceiling, est); res != reserveOK {
 		t.Fatalf("reserve c: %v", res)
 	}
 	if res, _ := st.Reserve("@d:vojo.chat", 1_000_000, 0, ceiling, est); res != reserveDeniedGlobal {
 		t.Fatalf("reserve d: got %v, want denied (reserved must have clamped to 0, not gone negative)", res)
 	}
 	// Nothing was ever settled, so committed spend stays 0 — release books no USD.
 	if spent, _ := st.SpentTodayUSD(); spent != 0 {
 		t.Fatalf("committed after release = %v, want 0 (a failed call bills nothing)", spent)
 	}
 }
 // TestStoreRequestLog covers the analytics row: total_usd is the component sum,
 // query_text is NULL unless captured, re-inserting one id is a no-op, and the
 // time-based trim removes old rows.
 func TestStoreRequestLog(t *testing.T) {
 	st := openTestStore(t)
 	defer st.Close()
 	noText := RequestLog{
 		ID: "$ev-rl-1", RoomID: "!r:vojo.chat", Sender: "@u:vojo.chat",
 		Route: routeGrokDirect, RouterSource: "default",
 		Models:    map[string]string{"final": "grok-x"},
 		Cost:      CostBreakdown{Token: 0.01, Grounding: 0.02},
 		LatencyMS: 1234, StageMS: map[string]int{"final": 1200},
 		ProviderRequestID: "prov-1", OK: true, // QueryText empty → NULL
 	}
 	if err := st.InsertRequestLog(noText); err != nil {
 		t.Fatalf("insert: %v", err)
 	}
 	// Re-inserting the same id is a no-op (ON CONFLICT DO NOTHING), not an error.
 	if err := st.InsertRequestLog(noText); err != nil {
 		t.Fatalf("re-insert: %v", err)
 	}
 	withText := RequestLog{ID: "$ev-rl-2", Route: routeTrivial, OK: false, QueryText: "hello"}
 	if err := st.InsertRequestLog(withText); err != nil {
 		t.Fatalf("insert-with-text: %v", err)
 	}
 	ctx, cancel := opContext()
 	defer cancel()
 	var route string
 	var total float64
 	var ok bool
 	var qt *string
 	if err := st.pool.QueryRow(ctx,
 		`SELECT route, total_usd, ok, query_text FROM request_log WHERE id = $1`, noText.ID).
 		Scan(&route, &total, &ok, &qt); err != nil {
 		t.Fatalf("read row1: %v", err)
 	}
 	if route != routeGrokDirect || !ok {
 		t.Fatalf("row1 = (%q, ok=%v), want (grok_direct, true)", route, ok)
 	}
 	if d := total - 0.03; d > 1e-9 || d < -1e-9 {
 		t.Fatalf("row1 total_usd = %v, want 0.03 (token+grounding)", total)
 	}
 	if qt != nil {
 		t.Fatalf("row1 query_text = %q, want NULL when text capture off", *qt)
 	}
 	if err := st.pool.QueryRow(ctx, `SELECT query_text FROM request_log WHERE id = $1`, withText.ID).Scan(&qt); err != nil {
 		t.Fatalf("read row2: %v", err)
 	}
 	if qt == nil || *qt != "hello" {
 		t.Fatalf("row2 query_text = %v, want \"hello\"", qt)
 	}
 	// Trim everything older than one hour from now → both rows (ts<now) gone.
 	if err := st.TrimRequestLog(time.Now().Add(time.Hour)); err != nil {
 		t.Fatalf("trim: %v", err)
 	}
 	var count int
 	if err := st.pool.QueryRow(ctx, `SELECT count(*) FROM request_log`).Scan(&count); err != nil {
 		t.Fatalf("count: %v", err)
 	}
 	if count != 0 {
 		t.Fatalf("after trim count = %d, want 0", count)
 	}
 }
 // TestStorePerUserUSDCap covers the optional per-user $ quota: a user is denied once
 // their own committed+reserved spend reaches the cap, other users are unaffected, and a
 // zero cap disables the check.
 func TestStorePerUserUSDCap(t *testing.T) {
 	st := openTestStore(t)
 	defer st.Close()
 	const user = "@u:vojo.chat"
 	const perUserUSD = 1.0
 	if err := st.Settle(user, 0, CostBreakdown{Token: 0.9}); err != nil {
 		t.Fatalf("settle: %v", err)
 	}
 	// $0.9 < $1.0 cap → admitted.
 	if res, err := st.Reserve(user, 1_000_000, perUserUSD, 1e9, 0); err != nil || res != reserveOK {
 		t.Fatalf("under per-user USD: (%v,%v), want reserveOK", res, err)
 	}
 	// Push the user over the cap.
 	if err := st.Settle(user, 0, CostBreakdown{Token: 0.5}); err != nil { // now $1.4
 		t.Fatalf("settle: %v", err)
 	}
 	if res, err := st.Reserve(user, 1_000_000, perUserUSD, 1e9, 0); err != nil || res != reserveDeniedUser {
 		t.Fatalf("over per-user USD: (%v,%v), want reserveDeniedUser", res, err)
 	}
 	// A different user is unaffected by the first user's spend.
 	if res, _ := st.Reserve("@v:vojo.chat", 1_000_000, perUserUSD, 1e9, 0); res != reserveOK {
 		t.Fatal("other user must be unaffected by the first user's per-user USD")
 	}
 	// perUserUSD == 0 disables the check entirely (the big spender is admitted again).
 	if res, _ := st.Reserve(user, 1_000_000, 0, 1e9, 0); res != reserveOK {
 		t.Fatal("perUserUSD=0 must disable the per-user $ cap")
 	}
 }
 // TestStoreGroundingCap covers the durable grounding cap guard: it admits up to the
 // cap, then denies; a non-positive cap denies everything.
 func TestStoreGroundingCap(t *testing.T) {
 	st := openTestStore(t)
 	defer st.Close()
 	const cap = 3
 	for i := 0; i < cap; i++ {
 		if ok, err := st.IncrGroundingIfUnder(cap); err != nil || !ok {
 			t.Fatalf("grounding %d: (%v,%v), want admitted", i, ok, err)
 		}
 	}
 	if ok, err := st.IncrGroundingIfUnder(cap); err != nil || ok {
 		t.Fatalf("over-cap grounding: (%v,%v), want denied", ok, err)
 	}
 	if ok, _ := st.IncrGroundingIfUnder(0); ok {
 		t.Fatal("cap 0 must deny everything (grounding off)")
 	}
 }
 // TestStoreGroundingCapConcurrent: the atomic check-increment must admit EXACTLY cap
 // under a concurrent burst, so a spike can't blow past the $/1k overage. Run under -race.
 func TestStoreGroundingCapConcurrent(t *testing.T) {
 	st := openTestStore(t)
 	defer st.Close()
 	const cap = 10
 	const goroutines = 50
 	var ok int64
 	var wg sync.WaitGroup
 	for i := 0; i < goroutines; i++ {
 		wg.Add(1)
 		go func() {
 			defer wg.Done()
 			if a, err := st.IncrGroundingIfUnder(cap); err == nil && a {
 				atomic.AddInt64(&ok, 1)
 			}
 		}()
 	}
 	wg.Wait()
 	if ok != cap {
 		t.Fatalf("concurrent grounding admitted %d, want exactly %d", ok, cap)
 	}
 }
 func TestStoreWarnedEncrypted(t *testing.T) {
 	st := openTestStore(t)
 	const room = "!enc:vojo.chat"
--- a/apps/ai-bot/telemetry.go
+++ b/apps/ai-bot/telemetry.go
@ -0,0 +1,124 @@
 package main
 import "time"
 // telemetry.go is the request_log analytics path: it captures route, cost, latency
 // and outcome for each engaged request so the real $/day and route mix can be
 // MEASURED (the build plan's whole "is the cascade worth it" question) instead of
 // modelled. It is strictly off the answer path — gated by TELEMETRY_ENABLED, written
 // in a recovered goroutine, and a write failure only logs a WARN. A request never
 // fails to be answered because telemetry couldn't be recorded.
 // Route names (also the request_log.route values). grok_direct is today's path; the
 // rest land behind flags in later phases. "none" means no model ran (a skip or a
 // limiter denial).
 const (
 	routeNone        = "none"
 	routeGrokDirect  = "grok_direct"
 	routeTrivial     = "trivial_direct"
 	routeWebThenGrok = "web_then_grok"
 	routeReason      = "reason_then_grok"
 )
 // Degrade/skip reason strings (request_log.degraded). Stable tokens so the analytics
 // can GROUP BY them.
 const (
 	degradeEncrypted  = "encrypted_room"
 	degradeMedia      = "media"
 	degradeForeign    = "foreign_room"
 	degradeEmpty      = "empty_completion"
 	degradeSendFailed = "send_failed"
 	degradeReserveErr = "reserve_error"
 	degradeRouter     = "router_failed"
 	degradeWeb        = "web_failed"
 	degradeTrivial    = "trivial_failed"
 	degradeGroundCap  = "grounding_cap"
 	degradeReasoning  = "reasoning_failed"
 )
 // telemetryTrimEvery bounds how often the retention trim runs — once per N writes,
 // off the hot path, so the analytics table stays time-bounded without a separate
 // lifecycle or a DELETE on every insert.
 const telemetryTrimEvery = 200
 // RequestLog is one analytics row (the request_log columns). Zero values are the
 // "didn't apply" case — a grok_direct request leaves the cascade fields zero.
 type RequestLog struct {
 	ID               string
 	RoomID           string
 	Sender           string
 	Route            string
 	RouterSource     string // heuristic|classifier|default|forced|degraded
 	RouterConfidence float64
 	Models           map[string]string // {"router":"…","final":"…"}
 	PromptTokens     int
 	CachedTokens     int
 	CompletionTokens int
 	Cost             CostBreakdown
 	LatencyMS int
 	StageMS   map[string]int // {"router":12,"web":1400,"final":2100}
 	Escalated         bool
 	FallbackFired     bool
 	CacheHit          bool
 	CeilingHit        bool
 	PerUserCapHit     bool
 	PromptVersion     string
 	ProviderRequestID string
 	Degraded          string
 	Err               string
 	OK                bool
 	QueryText         string // stored only when TELEMETRY_STORE_TEXT; stripped otherwise
 }
 // recordTelemetry persists a row off the answer path. No-op unless TELEMETRY_ENABLED.
 // The query text is stripped unless TELEMETRY_STORE_TEXT, so message content never
 // lands in the analytics table by default. Runs in a recovered goroutine and only
 // logs failures, so it can never drop or delay the reply.
 func (b *Bot) recordTelemetry(rl RequestLog) {
 	if !b.cfg.TelemetryEnabled {
 		return
 	}
 	if !b.cfg.TelemetryStoreText {
 		rl.QueryText = ""
 	}
 	b.safego("telemetry", func() {
 		if err := b.st.InsertRequestLog(rl); err != nil {
 			b.log.Warn("request_log insert failed (non-fatal)", "id", rl.ID, "err", err)
 		}
 		b.maybeTrimTelemetry()
 	})
 }
 // recordSkip logs a request the bot was addressed by but couldn't fully serve before
 // any model ran (encrypted/media/foreign). These are low-frequency, so a direct row
 // (route=none + reason) keeps the "why no answer" visible without flooding the table
 // with the common not-addressed drops, which are not logged (pre-claim best-effort).
 func (b *Bot) recordSkip(ev *Event, reason string) {
 	b.recordTelemetry(RequestLog{
 		ID:            ev.EventID,
 		RoomID:        ev.RoomID,
 		Sender:        ev.Sender,
 		Route:         routeNone,
 		RouterSource:  "default",
 		PromptVersion: b.promptVersion,
 		Degraded:      reason,
 		OK:            false,
 	})
 }
 // maybeTrimTelemetry runs the time-based retention trim once per telemetryTrimEvery
 // writes. Best-effort and off the hot path (called from the telemetry goroutine).
 func (b *Bot) maybeTrimTelemetry() {
 	if b.cfg.TelemetryRetention <= 0 {
 		return
 	}
 	if b.telemetryWrites.Add(1)%telemetryTrimEvery != 0 {
 		return
 	}
 	if err := b.st.TrimRequestLog(time.Now().Add(-b.cfg.TelemetryRetention)); err != nil {
 		b.log.Warn("request_log trim failed (non-fatal)", "err", err)
 	}
 }
--- a/apps/ai-bot/telemetry_test.go
+++ b/apps/ai-bot/telemetry_test.go
@ -0,0 +1,71 @@
 package main
 import (
 	"io"
 	"log/slog"
 	"testing"
 	"time"
 )
 // newTestBot builds a Bot with just the fields the telemetry path needs — no network,
 // so it sidesteps NewBot's identity check.
 func newTestBot(st *Store, cfg *Config) *Bot {
 	return &Bot{cfg: cfg, st: st, log: slog.New(slog.NewTextHandler(io.Discard, nil)), promptVersion: "testv"}
 }
 func requestLogCount(t *testing.T, st *Store) int {
 	t.Helper()
 	ctx, cancel := opContext()
 	defer cancel()
 	var n int
 	if err := st.pool.QueryRow(ctx, `SELECT count(*) FROM request_log`).Scan(&n); err != nil {
 		t.Fatalf("count: %v", err)
 	}
 	return n
 }
 // TestRecordSkipWritesRow proves the early-return telemetry path actually records a
 // row (route=none + the skip reason) when TELEMETRY_ENABLED is on. The write is async,
 // so poll briefly.
 func TestRecordSkipWritesRow(t *testing.T) {
 	st := openTestStore(t)
 	defer st.Close()
 	b := newTestBot(st, &Config{TelemetryEnabled: true})
 	ev := &Event{EventID: "$skip-1", RoomID: "!r:vojo.chat", Sender: "@u:vojo.chat"}
 	b.recordSkip(ev, degradeMedia)
 	deadline := time.Now().Add(2 * time.Second)
 	for requestLogCount(t, st) == 0 && time.Now().Before(deadline) {
 		time.Sleep(20 * time.Millisecond)
 	}
 	if n := requestLogCount(t, st); n != 1 {
 		t.Fatalf("telemetry rows = %d, want 1", n)
 	}
 	ctx, cancel := opContext()
 	defer cancel()
 	var route, degraded string
 	if err := st.pool.QueryRow(ctx,
 		`SELECT route, degraded FROM request_log WHERE id = $1`, ev.EventID).Scan(&route, &degraded); err != nil {
 		t.Fatalf("read: %v", err)
 	}
 	if route != routeNone || degraded != degradeMedia {
 		t.Fatalf("row = (%q,%q), want (none, media)", route, degraded)
 	}
 }
 // TestTelemetryDisabledWritesNothing proves the default (TELEMETRY_ENABLED off) adds
 // no write path — strict "cascade-off == today".
 func TestTelemetryDisabledWritesNothing(t *testing.T) {
 	st := openTestStore(t)
 	defer st.Close()
 	b := newTestBot(st, &Config{TelemetryEnabled: false})
 	b.recordSkip(&Event{EventID: "$skip-2", RoomID: "!r:vojo.chat", Sender: "@u:vojo.chat"}, degradeMedia)
 	// Give any (incorrect) async write time to land, then assert nothing was written.
 	time.Sleep(200 * time.Millisecond)
 	if n := requestLogCount(t, st); n != 0 {
 		t.Fatalf("telemetry rows = %d, want 0 (TELEMETRY_ENABLED off)", n)
 	}
 }
--- a/apps/ai-bot/util.go
+++ b/apps/ai-bot/util.go
@ -1,6 +1,17 @@
 package main
-import "sync"
+import (
 	"hash/fnv"
 	"sync"
 )
 // hashString is a cheap, stable 32-bit hash (FNV-1a). Used for opaque, non-identifying
 // derived ids (e.g. the prompt-cache conv id) — not for security.
 func hashString(s string) uint32 {
 	h := fnv.New32a()
 	_, _ = h.Write([]byte(s))
 	return h.Sum32()
 }
 // lruSet is a bounded insertion-ordered string set used for event-id dedup and
 // tracking our own sent event ids. Oldest entries evict once cap is reached.
--- a/apps/ai-bot/web.go
+++ b/apps/ai-bot/web.go
@ -0,0 +1,223 @@
 package main
 import (
 	"bytes"
 	"context"
 	"encoding/json"
 	"errors"
 	"fmt"
 	"io"
 	"log/slog"
 	"net/http"
 )
 // web.go is the pluggable web-freshness layer (Phase 3). A WebProvider fetches a
 // grounded factual digest + source URLs for a query; the cascade then has Grok
 // synthesise the final answer in voice from that digest. Two providers, chosen by
 // WEB_PROVIDER:
 //
 //   - grok_web_search (DEFAULT): the xAI Agent Tools `web_search` tool on the Responses
 //     API (/v1/responses). NB the older chat/completions Live Search `search_parameters`
 //     mechanism was RETIRED by xAI (now 410 Gone), and the web_search tool is not on
 //     chat/completions — hence the Responses endpoint. Billed $5/1k tool calls + tokens.
 //   - gemini_grounding: Gemini native v1beta google_search. Cheaper, but Gemini-3 only
 //     and silently ungrounds otherwise (F-EXT-3) — so it runs behind a citations
 //     verify-gate and degrades if it fails.
 //
 // The web call is bounded by a per-stage timeout (and gemini_grounding additionally by a
 // durable daily cap), and either provider failing degrades the request to grok_direct
 // with a staleness hedge (never silence, never stale-as-fresh).
 //
 // The grok_web_search Responses-API request/response shape was VALIDATED live against
 // /v1/responses (2026-06-01): output[].type=="message" → content[].output_text + inline
 // url_citation annotations; usage carries input/output tokens, cached subset, and the
 // web_search_calls count (one request can search several times — each billed). The
 // computed cost matched the API's own cost_in_usd_ticks to 4 dp. A parse miss still
 // degrades safely (empty digest → grok_direct).
 const (
 	webProviderGrokWebSearch   = "grok_web_search"
 	webProviderGeminiGrounding = "gemini_grounding"
 	// grokWebSearchPerCall is xAI's Agent Tools fee: $5 per 1,000 web_search tool calls.
 	grokWebSearchPerCall = 5.0 / 1000.0
 	// maxWebSearchCalls bounds the per-call fee in the reservation envelope (one Responses
 	// request can search several times; the actual count is billed exactly at settle).
 	maxWebSearchCalls = 4
 )
 // errGroundingCapped signals the daily web/grounded-prompt cap was hit, so the caller
 // degrades (with a hedge) rather than paying past the cap.
 var errGroundingCapped = errors.New("web grounding daily cap reached")
 // WebContext is the result of a web fetch: a factual digest to feed the final model,
 // the sources behind it, the fetch's own token usage, and the cost the fetch incurred
 // (kept separate from the final synthesis tokens so each books to its own ledger
 // column). Cost is populated even when Digest is empty/failed, because the call was
 // still billed — the caller books it before degrading (§8.1 partial cascade).
 type WebContext struct {
 	Digest    string
 	Citations []string
 	Usage     Usage
 	Cost      CostBreakdown
 }
 // WebProvider fetches grounded facts for a query. Stateless. It returns its cost in the
 // WebContext even on error (the call was billed), and an error when the digest is
 // unusable so the caller can degrade.
 type WebProvider interface {
 	Fetch(ctx context.Context, query string) (WebContext, error)
 }
 // --- grok_web_search (default): xAI Agent Tools web_search on the Responses API -------
 type grokWebSearch struct {
 	base   string
 	key    string
 	model  string
 	cfg    *Config
 	httpc  *http.Client
 	logger *slog.Logger
 }
 func newGrokWebSearch(cfg *Config, logger *slog.Logger) *grokWebSearch {
 	return &grokWebSearch{
 		base: cfg.XAIBaseURL, key: cfg.XAIAPIKey, model: cfg.XAIModel,
 		cfg: cfg, httpc: &http.Client{}, logger: logger,
 	}
 }
 type grokResponsesRequest struct {
 	Model string       `json:"model"`
 	Input string       `json:"input"`
 	Tools []openAITool `json:"tools"`
 	// Keep the fetch fast/cheap when the operator runs a unified model with effort
 	// "none"; empty → not sent (provider default). Validated against /v1/responses.
 	ReasoningEffort string `json:"reasoning_effort,omitempty"`
 }
 // grokResponsesResponse maps the xAI Responses API shape (verified live 2026-06-01):
 // output[] carries reasoning/web_search_call/message items; the message item's content
 // has output_text (with inline url_citation annotations); usage reports tokens, the
 // cached subset, and the count of server-side web_search calls (a single request can
 // make several, each billed).
 type grokResponsesResponse struct {
 	Output []struct {
 		Type    string `json:"type"`
 		Content []struct {
 			Type        string `json:"type"`
 			Text        string `json:"text"`
 			Annotations []struct {
 				Type string `json:"type"`
 				URL  string `json:"url"`
 			} `json:"annotations"`
 		} `json:"content"`
 	} `json:"output"`
 	Usage struct {
 		InputTokens        int `json:"input_tokens"`
 		OutputTokens       int `json:"output_tokens"`
 		InputTokensDetails struct {
 			CachedTokens int `json:"cached_tokens"`
 		} `json:"input_tokens_details"`
 		ServerSideToolUsageDetails struct {
 			WebSearchCalls int `json:"web_search_calls"`
 		} `json:"server_side_tool_usage_details"`
 	} `json:"usage"`
 }
 func (p *grokWebSearch) Fetch(ctx context.Context, query string) (WebContext, error) {
 	body, err := json.Marshal(grokResponsesRequest{
 		Model: p.model, Input: query, Tools: []openAITool{{Type: "web_search"}},
 		ReasoningEffort: p.cfg.GrokReasoningEffort,
 	})
 	if err != nil {
 		return WebContext{}, err
 	}
 	req, err := http.NewRequestWithContext(ctx, http.MethodPost, p.base+"/responses", bytes.NewReader(body))
 	if err != nil {
 		return WebContext{}, err
 	}
 	req.Header.Set("Content-Type", "application/json")
 	req.Header.Set("Authorization", "Bearer "+p.key)
 	resp, err := p.httpc.Do(req)
 	if err != nil {
 		return WebContext{}, err
 	}
 	defer resp.Body.Close()
 	data, _ := io.ReadAll(resp.Body)
 	if resp.StatusCode < 200 || resp.StatusCode >= 300 {
 		return WebContext{}, fmt.Errorf("grok web search http %d: %s", resp.StatusCode, snippet(data))
 	}
 	var out grokResponsesResponse
 	if err := json.Unmarshal(data, &out); err != nil {
 		return WebContext{}, fmt.Errorf("grok web search decode: %w", err)
 	}
 	var digest string
 	var citations []string
 	for _, item := range out.Output {
 		if item.Type != "message" {
 			continue
 		}
 		for _, c := range item.Content {
 			if c.Type == "output_text" {
 				digest += c.Text
 			}
 			for _, a := range c.Annotations {
 				if a.Type == "url_citation" && a.URL != "" {
 					citations = append(citations, a.URL)
 				}
 			}
 		}
 	}
 	usage := Usage{
 		PromptTokens:     out.Usage.InputTokens,
 		CachedTokens:     out.Usage.InputTokensDetails.CachedTokens,
 		CompletionTokens: out.Usage.OutputTokens,
 	}
 	// Cost = the call's tokens + the $5/1k fee times the ACTUAL number of web_search
 	// calls the request made (one request can search several times). Booked even when the
 	// digest is empty (the 2xx was billed), so the caller accounts for it before degrading.
 	// Cross-checked live against the API's own cost_in_usd_ticks — matched to 4 dp.
 	wc := WebContext{
 		Digest:    digest,
 		Citations: citations,
 		Usage:     usage,
 		Cost: CostBreakdown{
 			WebTool: computeUSD(p.model, usage, p.cfg) +
 				float64(out.Usage.ServerSideToolUsageDetails.WebSearchCalls)*grokWebSearchPerCall,
 		},
 	}
 	if digest == "" {
 		return wc, fmt.Errorf("grok web search: empty result")
 	}
 	return wc, nil
 }
 // --- gemini_grounding (Gemini-3 native only) --------------------------------------
 type geminiGrounding struct {
 	gem *geminiClient
 	st  *Store
 	cfg *Config
 }
 func (p *geminiGrounding) Fetch(ctx context.Context, query string) (WebContext, error) {
 	// Durable, atomic daily cap FIRST: a grounded prompt is billed whether or not it
 	// grounds, and the per-prompt overage ($35/1k on 2.5) is the cost this guard exists
 	// to bound. Admit against the cap before spending. (grok_web_search needs no such
 	// cap — its $5/1k per-call fee is fully reserved per request and bounded by the
 	// per-user request cap + global ceiling.)
 	if ok, err := p.st.IncrGroundingIfUnder(p.cfg.WebGroundingDailyCap); err != nil {
 		return WebContext{}, err
 	} else if !ok {
 		return WebContext{}, errGroundingCapped
 	}
 	res, err := p.gem.groundedSearch(ctx, query) // errors (incl. no-citations) → caller degrades
 	cost := CostBreakdown{Grounding: computeUSD(p.cfg.GeminiModel, res.Usage, p.cfg)}
 	if err != nil {
 		return WebContext{Cost: cost, Usage: res.Usage}, err
 	}
 	return WebContext{Digest: res.Digest, Citations: res.Citations, Usage: res.Usage, Cost: cost}, nil
 }
--- a/apps/ai-bot/xai.go
+++ b/apps/ai-bot/xai.go
@ -1,171 +0,0 @@
 package main
 import (
 	"bytes"
 	"context"
 	"encoding/json"
 	"fmt"
 	"io"
 	"log/slog"
 	"math/rand"
 	"net/http"
 	"time"
 )
 // XAIClient talks the OpenAI-compatible Chat Completions endpoint at
 // {base}/chat/completions with a Bearer key.
 type XAIClient struct {
 	base   string
 	key    string
 	http   *http.Client
 	maxTry int
 	log    *slog.Logger
 }
 func NewXAIClient(base, key string, logger *slog.Logger) *XAIClient {
 	return &XAIClient{
 		base:   base,
 		key:    key,
 		http:   &http.Client{},
 		maxTry: 3,
 		log:    logger,
 	}
 }
 type xaiMessage struct {
 	Role    string `json:"role"`
 	Content string `json:"content"`
 }
 type xaiRequest struct {
 	Model       string       `json:"model"`
 	Messages    []xaiMessage `json:"messages"`
 	MaxTokens   int          `json:"max_tokens"`
 	Temperature float64      `json:"temperature"`
 	Stream      bool         `json:"stream"`
 }
 type xaiUsage struct {
 	PromptTokens        int `json:"prompt_tokens"`
 	CompletionTokens    int `json:"completion_tokens"`
 	PromptTokensDetails struct {
 		CachedTokens int `json:"cached_tokens"`
 	} `json:"prompt_tokens_details"`
 }
 type xaiResponse struct {
 	Choices []struct {
 		Message struct {
 			Content string `json:"content"`
 		} `json:"message"`
 		FinishReason string `json:"finish_reason"`
 	} `json:"choices"`
 	Usage xaiUsage `json:"usage"`
 }
 func (r *xaiResponse) Text() string {
 	if len(r.Choices) == 0 {
 		return ""
 	}
 	return r.Choices[0].Message.Content
 }
 // Complete calls Chat Completions with retry on transient failures (429 / 5xx /
 // network timeout, exponential backoff + jitter). Non-retryable 4xx fail
 // immediately. On exhaustion the caller refunds the reserved request and notifies
 // the user, so a transient failure is never silently swallowed (F6).
 func (x *XAIClient) Complete(ctx context.Context, model string, msgs []xaiMessage, maxTokens int, temp float64) (*xaiResponse, error) {
 	reqBody := xaiRequest{
 		Model:       model,
 		Messages:    msgs,
 		MaxTokens:   maxTokens,
 		Temperature: temp,
 		Stream:      false,
 	}
 	payload, err := json.Marshal(reqBody)
 	if err != nil {
 		return nil, err
 	}
 	var lastErr error
 	for attempt := 0; attempt < x.maxTry; attempt++ {
 		if attempt > 0 {
 			// 0.5s, 1s, 2s … capped at 8s, plus up to 250ms jitter.
 			backoff := time.Duration(500<<uint(attempt-1)) * time.Millisecond
 			if backoff > 8*time.Second {
 				backoff = 8 * time.Second
 			}
 			backoff += time.Duration(rand.Intn(250)) * time.Millisecond
 			select {
 			case <-ctx.Done():
 				return nil, ctx.Err()
 			case <-time.After(backoff):
 			}
 		}
 		resp, retryable, err := x.attempt(ctx, payload)
 		if err == nil {
 			return resp, nil
 		}
 		lastErr = err
 		if ctx.Err() != nil {
 			return nil, ctx.Err()
 		}
 		if !retryable {
 			return nil, err
 		}
 		if x.log != nil {
 			x.log.Warn("xai attempt failed, will retry", "attempt", attempt+1, "max", x.maxTry, "err", err)
 		}
 	}
 	return nil, fmt.Errorf("xai: exhausted %d attempts: %w", x.maxTry, lastErr)
 }
 // attempt performs one HTTP call. It returns retryable=true for 429/5xx and
 // network errors, false for other non-2xx (terminal 4xx).
 func (x *XAIClient) attempt(ctx context.Context, payload []byte) (*xaiResponse, bool, error) {
 	// Per-attempt deadline so a hung connection doesn't block the whole loop.
 	attemptCtx, cancel := context.WithTimeout(ctx, 60*time.Second)
 	defer cancel()
 	req, err := http.NewRequestWithContext(attemptCtx, http.MethodPost, x.base+"/chat/completions", bytes.NewReader(payload))
 	if err != nil {
 		return nil, false, err
 	}
 	req.Header.Set("Content-Type", "application/json")
 	req.Header.Set("Authorization", "Bearer "+x.key)
 	resp, err := x.http.Do(req)
 	if err != nil {
 		// Network error / timeout — retryable (unless the parent ctx is done).
 		return nil, ctx.Err() == nil, err
 	}
 	defer resp.Body.Close()
 	data, _ := io.ReadAll(resp.Body)
 	if resp.StatusCode == http.StatusTooManyRequests || resp.StatusCode >= 500 {
 		return nil, true, fmt.Errorf("xai http %d: %s", resp.StatusCode, snippet(data))
 	}
 	if resp.StatusCode < 200 || resp.StatusCode >= 300 {
 		return nil, false, fmt.Errorf("xai http %d: %s", resp.StatusCode, snippet(data))
 	}
 	var out xaiResponse
 	if err := json.Unmarshal(data, &out); err != nil {
 		return nil, false, fmt.Errorf("xai decode: %w", err)
 	}
 	// A 2xx is a billed call even when the model returns empty content (content
 	// filter, finish_reason=length with no text, or no choices). Return it as a
 	// success so the caller books the real cost via Reconcile instead of refunding
 	// the slot and losing the spend — which would let empty replies bypass BOTH the
 	// per-user cap and the global ceiling. The caller just won't send an empty body.
 	return &out, false, nil
 }
 func snippet(b []byte) string {
 	const max = 300
 	if len(b) > max {
 		return string(b[:max]) + "…"
 	}
 	return string(b)
 }