feat(ai-bot): add a flag-gated, provider-neutral answer cascade with a reserved-USD ledger and request telemetry

This commit is contained in:
heaven 2026-06-01 18:06:15 +03:00
parent f7f6984d18
commit ff8918dae1
24 changed files with 2970 additions and 313 deletions

View file

@ -32,12 +32,20 @@ apps/ai-bot/
├── registration.go # generate + read registration.yaml (tokens, mautrix idiom) ├── registration.go # generate + read registration.yaml (tokens, mautrix idiom)
├── events.go # Matrix event types + decoders ├── events.go # Matrix event types + decoders
├── mentions.go # m.mentions + pill/reply fallbacks (F29/F30) ├── mentions.go # m.mentions + pill/reply fallbacks (F29/F30)
├── context.go # xAI message-window assembly (trigger + bot replies) ├── context.go # provider-neutral message-window assembly (trigger + bot replies)
├── xai.go # chat/completions client + retry (F6) ├── llm.go # provider-neutral types + LLMClient interface (no vendor names)
├── store.go # Postgres (vojo_ai): spend ledger, txn/event dedup, encrypted-warned set ├── httpllm.go # shared OpenAI-compatible chat/completions transport + retry (F6)
├── messages.go # bot-authored RU notices ├── provider_xai.go # thin xAI/Grok adapter over the shared transport
├── provider_gemini.go # Gemini adapter: OpenAI-compat client + native v1beta grounding
├── pricing.go # per-model price table (priceFor) + CostBreakdown
├── router.go # cascade router: Layer-0 heuristic + optional Layer-1 Gemini classifier
├── cascade.go # generate(): route dispatch with degrade-to-grok_direct
├── web.go # WebProvider: grok_web_search (Live Search) | gemini_grounding + cap guard
├── telemetry.go # request_log analytics row + async emit + retention trim
├── store.go # Postgres (vojo_ai): spend ledger (+reservation/components), dedup, request_log, grounding cap
├── messages.go # language-free emoji status reactions
├── markdown.go # markdown → org.matrix.custom.html for the reply's formatted_body ├── markdown.go # markdown → org.matrix.custom.html for the reply's formatted_body
├── util.go # bounded dedup set ├── util.go # bounded dedup set + small hash
├── prompts/system_ru.txt ├── prompts/system_ru.txt
├── Dockerfile # CGO-free static build → distroless, EXPOSE 8009 ├── Dockerfile # CGO-free static build → distroless, EXPOSE 8009
└── .env.example └── .env.example
@ -49,8 +57,15 @@ All via environment (see `.env.example`). Required: `HOMESERVER_URL`, `BOT_MXID`
`AS_TOKEN`, `HS_TOKEN`, `XAI_API_KEY`, `ALLOWED_SERVERS`, `AI_BOT_DATABASE_URL`. `AS_TOKEN`, `HS_TOKEN`, `XAI_API_KEY`, `ALLOWED_SERVERS`, `AI_BOT_DATABASE_URL`.
`AS_ADDR` (default `:8009`) is the transaction-push listen address — it must match `AS_ADDR` (default `:8009`) is the transaction-push listen address — it must match
the `url` port in the registration. The model is env-configurable (`XAI_MODEL`, the `url` port in the registration. The model is env-configurable (`XAI_MODEL`,
default `grok-4.20-0309-non-reasoning`; `grok-4.3` is an alternative — **re-verify default `grok-4.20-0309-non-reasoning`).
the id + price on docs.x.ai before deploy**).
`grok-4.3` is the newer unified model (same price, 1M context): one model with a
`reasoning_effort` dial. If you switch `XAI_MODEL=grok-4.3`, set
`GROK_REASONING_EFFORT=none` to keep the default voice fast/cheap — otherwise the API
defaults to `low` and reasons on **every** reply. `GROK_REASONING_EFFORT` (accepted:
`none|low|medium|high`, default empty = not sent) is applied to the normal Grok voice
(grok_direct + web synthesis); leave it **empty** for `grok-4.20-non-reasoning`, which
rejects the param. The reason_then_grok route always uses `high` regardless.
### Database ### Database
@ -78,8 +93,49 @@ AI_BOT_DATABASE_URL=postgres://vojo_ai:<secret>@postgres:5432/vojo_ai?sslmode=di
``` ```
The hard USD ceiling is priced from the **API-returned token usage** times the The hard USD ceiling is priced from the **API-returned token usage** times the
configured `XAI_PRICE_*_PER_M` fallbacks, so a price change only needs those per-model price table (`XAI_PRICE_*_PER_M`, `GEMINI_PRICE_*_PER_M`), so a price
constants updated — it can't silently blow the cap. change only needs those constants updated — it can't silently blow the cap. The
ceiling is enforced with an optimistic **reservation** (`reserved_usd`): a request's
estimated max-cost is booked at admission and settled to the real cost afterward, so
a burst of concurrent requests can't slip past `DAILY_USD_CEILING` (it would
otherwise, since the USD only lands after each call).
### Operator accounting (Phase 1, on by default)
- `REQUEST_BUDGET_SECONDS` (default 180) — overall per-request deadline shared by all
model calls, so a slow/retried call (or a cascade) can't accrete minutes.
- `GROK_PROMPT_CACHE` (default false) — Grok caches prompt prefixes automatically; this
toggle only adds the `x-grok-conv-id` routing header (a per-room id) to raise the
cache hit rate. There is no `prompt_cache` body param (verified on docs.x.ai).
- `TELEMETRY_ENABLED` (default false) — write a `request_log` analytics row per engaged
request (route, per-component $, latency, degrade/ceiling reasons). The write is async
and isolated — its failure never drops a reply. `TELEMETRY_STORE_TEXT` (default false)
additionally keeps the query text (for offline eval); `TELEMETRY_RETENTION_DAYS`
(default 30) time-trims old rows. Turn telemetry on to MEASURE the base before enabling
any cascade layer.
### Cascade (Phase 2-4) — behind flags, **default OFF** (every layer off == today's bot)
All optional; an unset env is exactly today's single grok_direct call. Any layer off or
failing **degrades to grok_direct** (never silence). Do **not** enable in prod until the
offline-eval gate (misroute < 2-3% AND measured saving > the second provider's cost; see
`docs/plans/ai_backend_build_plan.md` §9).
| Env | Default | Meaning |
|---|---|---|
| `ROUTER_ENABLED` | false | Layer-0 heuristic router (else everything → grok_direct) |
| `ROUTER_CLASSIFIER_ENABLED` | false | Layer-1 Gemini classifier on uncertain cases (requires `ROUTER_ENABLED` + Gemini key) |
| `TRIVIAL_OFFLOAD_ENABLED` | false | answer trivial messages with Gemini (requires Gemini key) |
| `WEB_ENABLED` | false | web_then_grok route (Gemini/Grok fetches fresh facts, **Grok stays the voice**) |
| `WEB_PROVIDER` | `grok_web_search` | `grok_web_search` (xAI Agent Tools `web_search` on the Responses API, $5/1k calls, no Gemini key) or `gemini_grounding` (**cheapest**: Gemini does the fetch via native v1beta `google_search`, Grok voices it — ~$0.0013/query, validated on `gemini-2.5-flash-lite`; the F-EXT-3 "Gemini-3 only" caveat is the OpenAI-compat endpoint, native v1beta works on 2.5). Requires `GEMINI_API_KEY`. |
| `WEB_GROUNDING_DAILY_CAP` | 450 | durable per-day cap for `gemini_grounding` before degrading (keep < the 500/day free grounding RPD; guards the per-1k overage) |
| `REASONING_ENABLED` | false | manual "think harder" route on `REASONING_TRIGGER` |
| `REASONING_TRIGGER` | `подумай глубже` | trigger phrase |
| `REASONING_MODEL` | `grok-4.3` | a **reasoning-capable** model (the default `grok-4.20-non-reasoning` rejects `reasoning_effort`) |
| `REASONING_EFFORT` | `high` | the reasoning_effort the "think harder" route sends (`nonelowmediumhigh`) |
| `GEMINI_API_KEY` / `_FILE` | — | required only when a Gemini-using layer is on (fail-fast at startup otherwise) |
| `GEMINI_MODEL` | `gemini-2.5-flash-lite` | cheap model for trivial/classifier |
| `GEMINI_BASE_URL` | `…/v1beta/openai` | OpenAI-compat endpoint (native grounding endpoint derived from it) |
## One-time setup (appservice registration) ## One-time setup (appservice registration)

View file

@ -32,7 +32,7 @@ func openTestStore(t *testing.T) *Store {
} }
ctx, cancel := opContext() ctx, cancel := opContext()
defer cancel() defer cancel()
if _, err := st.pool.Exec(ctx, `TRUNCATE processed_txn, processed_event, spend, warned_encrypted`); err != nil { if _, err := st.pool.Exec(ctx, `TRUNCATE processed_txn, processed_event, spend, warned_encrypted, request_log, grounding_count`); err != nil {
st.Close() st.Close()
t.Fatalf("truncate test tables: %v", err) t.Fatalf("truncate test tables: %v", err)
} }

View file

@ -5,6 +5,7 @@ import (
"fmt" "fmt"
"log/slog" "log/slog"
"sync" "sync"
"sync/atomic"
"time" "time"
) )
@ -30,9 +31,22 @@ type Bot struct {
cfg *Config cfg *Config
log *slog.Logger log *slog.Logger
mx *MatrixClient mx *MatrixClient
xai *XAIClient llm LLMClient
st *Store st *Store
// gemini is the cheap chat backend for the trivial route and the Layer-1 classifier
// (an LLMClient so tests can fake it); nil unless a layer that uses it is enabled.
// web is the web-freshness provider, built only when WEB_ENABLED. Both nil → the
// cascade can only ever produce grok_direct.
gemini LLMClient
web WebProvider
// promptVersion is a short stable hash of the system prompt, logged with each
// request so prompt changes are visible in the analytics (A/B + regressions).
promptVersion string
// telemetryWrites paces the retention trim (every telemetryTrimEvery writes).
telemetryWrites atomic.Uint64
// mu guards the in-memory maps/sets below. Each transaction is acked to Synapse // mu guards the in-memory maps/sets below. Each transaction is acked to Synapse
// immediately (appservice.go) and its events are processed in transaction order, // immediately (appservice.go) and its events are processed in transaction order,
// but the slow xAI generation runs in a per-room goroutine and the lazy probes run // but the slow xAI generation runs in a per-room goroutine and the lazy probes run
@ -49,7 +63,7 @@ type Bot struct {
func NewBot(ctx context.Context, cfg *Config, logger *slog.Logger) (*Bot, error) { func NewBot(ctx context.Context, cfg *Config, logger *slog.Logger) (*Bot, error) {
mx := NewMatrixClient(cfg.HomeserverURL, cfg.ASToken, cfg.BotMXID) mx := NewMatrixClient(cfg.HomeserverURL, cfg.ASToken, cfg.BotMXID)
xai := NewXAIClient(cfg.XAIBaseURL, cfg.XAIAPIKey, logger) llm := NewXAIClient(cfg.XAIBaseURL, cfg.XAIAPIKey, logger)
st, err := OpenStore(cfg.DatabaseURL) st, err := OpenStore(cfg.DatabaseURL)
if err != nil { if err != nil {
@ -57,16 +71,35 @@ func NewBot(ctx context.Context, cfg *Config, logger *slog.Logger) (*Bot, error)
} }
b := &Bot{ b := &Bot{
cfg: cfg, cfg: cfg,
log: logger, log: logger,
mx: mx, mx: mx,
xai: xai, llm: llm,
st: st, st: st,
seen: newLRUSet(5000), promptVersion: fmt.Sprintf("%08x", hashString(cfg.SystemPrompt)),
botSent: newLRUSet(5000), seen: newLRUSet(5000),
meta: make(map[string]*roomMeta), botSent: newLRUSet(5000),
buf: make(map[string][]bufferedMsg), meta: make(map[string]*roomMeta),
inflight: make(map[string]bool), buf: make(map[string][]bufferedMsg),
inflight: make(map[string]bool),
}
// Build the cascade backends only for enabled layers (config already fail-fast
// validated that the keys exist). With every cascade flag off these stay nil and
// generate() can only produce grok_direct — today's bot. The grounding web provider
// needs the concrete client (for the native generateContent call), so keep a typed
// handle alongside the LLMClient face.
var gc *geminiClient
if cfg.needsGemini() {
gc = NewGeminiClient(cfg.GeminiBaseURL, cfg.GeminiAPIKey, cfg.GeminiModel, logger)
b.gemini = gc
}
if cfg.WebEnabled {
if cfg.WebProvider == webProviderGeminiGrounding {
b.web = &geminiGrounding{gem: gc, st: st, cfg: cfg}
} else {
b.web = newGrokWebSearch(cfg, logger)
}
} }
// Confirm the as_token + user_id resolves to BOT_MXID before serving. // Confirm the as_token + user_id resolves to BOT_MXID before serving.
@ -223,7 +256,11 @@ func (b *Bot) handleMessage(ctx context.Context, ev *Event) {
// bot can't read it. The probe runs without the lock. // bot can't read it. The probe runs without the lock.
if b.ensureEncryption(ctx, roomID) { if b.ensureEncryption(ctx, roomID) {
b.log.Debug("skip: encrypted room", "room", roomID) b.log.Debug("skip: encrypted room", "room", roomID)
b.reactEncryptedOnce(ctx, roomID, ev.EventID) // Log the skip only when we actually react (once per room), so an encrypted room
// the bot can't read doesn't flood request_log with one row per message.
if b.reactEncryptedOnce(ctx, roomID, ev.EventID) {
b.recordSkip(ev, degradeEncrypted)
}
return return
} }
@ -255,6 +292,7 @@ func (b *Bot) handleMessage(ctx context.Context, ev *Event) {
// "leak" the bot into) a federated room with non-consenting third parties. // "leak" the bot into) a federated room with non-consenting third parties.
if foreign { if foreign {
b.leaveForeign(ctx, roomID) b.leaveForeign(ctx, roomID)
b.recordSkip(ev, degradeForeign)
return return
} }
@ -282,6 +320,7 @@ func (b *Bot) handleMessage(ctx context.Context, ev *Event) {
if isMedia { if isMedia {
b.log.Debug("skip: non-text msgtype (reacted)", "room", roomID, "sender", ev.Sender, "msgtype", mc.MsgType) b.log.Debug("skip: non-text msgtype (reacted)", "room", roomID, "sender", ev.Sender, "msgtype", mc.MsgType)
b.react(ctx, roomID, ev.EventID, reactMedia) b.react(ctx, roomID, ev.EventID, reactMedia)
b.recordSkip(ev, degradeMedia)
return return
} }
@ -312,20 +351,46 @@ func (b *Bot) handleMessage(ctx context.Context, ev *Event) {
const unlimitedCap = 1 << 30 const unlimitedCap = 1 << 30
func (b *Bot) respond(ctx context.Context, roomID string, isDM bool, ev *Event, mc *MessageContent, history []bufferedMsg) { func (b *Bot) respond(ctx context.Context, roomID string, isDM bool, ev *Event, mc *MessageContent, history []bufferedMsg) {
started := time.Now()
// One telemetry row per request, populated as the flow decides its outcome and
// emitted once via defer — so every exit (deny, error, empty, paid silence, success)
// is recorded without scattering writes (F-FUNC-5). It starts as route=none/ok=false;
// proceeding to the model sets the route, success sets ok=true.
rl := RequestLog{
ID: ev.EventID, RoomID: roomID, Sender: ev.Sender,
Route: routeNone, RouterSource: "default",
PromptVersion: b.promptVersion,
QueryText: mc.Body,
Models: map[string]string{"final": b.cfg.XAIModel},
}
defer func() {
rl.LatencyMS = int(time.Since(started).Milliseconds())
b.recordTelemetry(rl)
}()
perUserCap := b.cfg.PerUserDailyCap perUserCap := b.cfg.PerUserDailyCap
perUserUSD := b.cfg.PerUserDailyUSD
if b.cfg.UnlimitedUsers[ev.Sender] { if b.cfg.UnlimitedUsers[ev.Sender] {
perUserCap = unlimitedCap perUserCap = unlimitedCap
perUserUSD = 0 // exempt from both per-user gates; the global ceiling still applies
} }
switch res, err := b.st.Reserve(ev.Sender, perUserCap, b.cfg.DailyUSDCeiling); { // Reserve the route's estimated max-cost (not $0) so the global ceiling counts
// this in-flight call BEFORE it returns — the TOCTOU fix (§8.1). The envelope covers
// the most expensive ENABLED route, so whichever the router picks is admitted within
// the reservation; with the cascade off it is exactly grok_direct's estimate.
estimate := b.reserveEstimate()
switch res, err := b.st.Reserve(ev.Sender, perUserCap, perUserUSD, b.cfg.DailyUSDCeiling, estimate); {
case err != nil: case err != nil:
// A limiter failure is on our side — don't leave the user wondering. // A limiter failure is on our side — don't leave the user wondering.
b.log.Error("limiter reserve failed", "sender", ev.Sender, "err", err) b.log.Error("limiter reserve failed", "sender", ev.Sender, "err", err)
rl.Degraded, rl.Err = degradeReserveErr, err.Error()
b.react(ctx, roomID, ev.EventID, reactError) b.react(ctx, roomID, ev.EventID, reactError)
return return
case res == reserveDeniedUser: case res == reserveDeniedUser:
// Per-user cap (anti-abuse, F24): stop answering, but always signal the limit — // Per-user cap (anti-abuse, F24): stop answering, but always signal the limit —
// no message addressed to the bot is left without feedback. // no message addressed to the bot is left without feedback.
b.log.Info("per-user daily cap reached; reacting", "sender", ev.Sender) b.log.Info("per-user daily cap reached; reacting", "sender", ev.Sender)
rl.PerUserCapHit = true
b.react(ctx, roomID, ev.EventID, reactRateLimit) b.react(ctx, roomID, ev.EventID, reactRateLimit)
return return
case res == reserveDeniedGlobal: case res == reserveDeniedGlobal:
@ -333,23 +398,80 @@ func (b *Bot) respond(ctx context.Context, roomID string, isDM bool, ev *Event,
// once-per-day text notice), so signal every affected message rather than // once-per-day text notice), so signal every affected message rather than
// going silent after the first. // going silent after the first.
b.log.Warn("global daily USD ceiling reached", "room", roomID, "sender", ev.Sender) b.log.Warn("global daily USD ceiling reached", "room", roomID, "sender", ev.Sender)
rl.CeilingHit = true
b.react(ctx, roomID, ev.EventID, reactRateLimit) b.react(ctx, roomID, ev.EventID, reactRateLimit)
return return
} }
// Past admission, a reservation + request slot are held. Guarantee they're freed on
// ANY exit that didn't settle — including a panic in generate() (recovered by safego)
// — so a leaked reservation can never permanently drift the global ceiling down
// (§8.1c). The normal paths set settled=true at Settle, so this defer then no-ops; it
// fires only on the panic/unexpected-return path, where it also reacts so the failure
// isn't silent.
settled := false
defer func() {
if settled {
return
}
rl.Degraded, rl.Err = "panic", "generation panicked or returned without settling"
if rerr := b.st.ReleaseReservation(ev.Sender, estimate); rerr != nil {
b.log.Error("release reservation (unsettled) failed", "sender", ev.Sender, "err", rerr)
}
if rerr := b.st.RefundRequest(ev.Sender); rerr != nil {
b.log.Error("refund (unsettled) failed", "sender", ev.Sender, "err", rerr)
}
b.react(ctx, roomID, ev.EventID, reactError)
}()
// Show "Vojo AI печатает…" for the whole generation. The keepalive refreshes the // Show "Vojo AI печатает…" for the whole generation. The keepalive refreshes the
// typing notification every 20s (the server expires it after 30s) so the indicator // typing notification every 20s (the server expires it after 30s) so the indicator
// never lapses on a slow/retried answer, and the deferred stop clears it on exit. // never lapses on a slow/retried answer, and the deferred stop clears it on exit.
stopTyping := b.startTypingKeepalive(ctx, roomID) stopTyping := b.startTypingKeepalive(ctx, roomID)
defer stopTyping() defer stopTyping()
msgs := buildContext(b.cfg.SystemPrompt, history, isDM, mc.Body, b.cfg.MaxCtxEvent, 8000) // Overall per-request deadline (§8.2.2): every model call in the cascade shares this
resp, err := b.xai.Complete(ctx, b.cfg.XAIModel, msgs, b.cfg.MaxOutTok, b.cfg.XAITemp) // single budget (genCtx), so a multi-stage route can't accrete minutes the way
// per-stage 3×60s retries would. react/send/store ops use the live room ctx, NOT
// genCtx, so a budget timeout still surfaces as a ⚠️ react, never silence.
genCtx, cancel := context.WithTimeout(ctx, b.cfg.RequestBudget)
defer cancel()
msgs := buildContext(b.cfg.SystemPrompt, history, isDM, mc.Body, b.cfg.MaxCtxEvent, maxPromptTokens)
res, err := b.generate(genCtx, mc.Body, msgs, b.convID(roomID))
// Record what the routing + generation actually did, whatever the outcome.
rl.Route = res.route
rl.RouterSource = res.decision.Source
rl.RouterConfidence = res.decision.Confidence
rl.FallbackFired = res.fallback
rl.Escalated = res.route == routeReason
rl.Cost = res.cost
if res.stageMS != nil {
rl.StageMS = res.stageMS
}
if res.finalModel != "" {
rl.Models["final"] = res.finalModel
}
if res.decision.Source == "classifier" {
rl.Models["router"] = b.cfg.GeminiModel
}
if res.degraded != "" {
rl.Degraded = res.degraded
}
if err != nil { if err != nil {
// at-most-once already retried transient failures inside Complete; refund the // Terminal: even grok_direct failed. Settle whatever the cascade ACTUALLY spent
// reserved request so an xAI outage doesn't burn the user's daily cap, and // (e.g. a paid web fetch before the failure) and release the rest of the
// signal the failure (react → no anti-loop, no language). // reservation in one step, then refund the request slot so an outage doesn't burn
b.log.Error("xai completion failed", "sender", ev.Sender, "err", err) // the cap, and react (never silent). Settle with an all-zero cost is just a
// release, so a pure grok_direct failure books nothing — exactly as before.
b.log.Error("generation failed", "sender", ev.Sender, "route", res.route, "err", err)
rl.Err = err.Error()
if serr := b.st.Settle(ev.Sender, estimate, res.cost); serr != nil {
b.log.Error("settle (failed request) failed", "sender", ev.Sender, "err", serr)
}
settled = true
if rerr := b.st.RefundRequest(ev.Sender); rerr != nil { if rerr := b.st.RefundRequest(ev.Sender); rerr != nil {
b.log.Error("refund failed", "sender", ev.Sender, "err", rerr) b.log.Error("refund failed", "sender", ev.Sender, "err", rerr)
} }
@ -357,39 +479,86 @@ func (b *Bot) respond(ctx context.Context, roomID string, isDM bool, ev *Event,
return return
} }
// A 2xx from xAI is billed even if the text came back empty — always book the real // Success from some route. Settle: release the reservation and book the real
// cost so both caps see it (an empty 200 must not bypass the per-user cap and the // per-component cost, so both caps see grounding/tool fees too — not just tokens.
// global ceiling). if err := b.st.Settle(ev.Sender, estimate, res.cost); err != nil {
usd := computeUSD(resp.Usage, b.cfg) b.log.Error("settle spend failed", "sender", ev.Sender, "err", err)
if err := b.st.Reconcile(ev.Sender, usd); err != nil {
b.log.Error("reconcile spend failed", "sender", ev.Sender, "err", err)
} }
settled = true
rl.PromptTokens, rl.CachedTokens, rl.CompletionTokens =
res.usage.PromptTokens, res.usage.CachedTokens, res.usage.CompletionTokens
rl.CacheHit = res.usage.CachedTokens > 0
rl.ProviderRequestID = res.providerID
text := resp.Text() text := res.text
if text == "" { if text == "" {
// Billed but no usable text (content filter / length cap / empty choices). Never // Billed but no usable text (content filter / length cap / empty choices). Never
// leave a billed request without feedback — react "couldn't answer". // leave a billed request without feedback — react "couldn't answer". The slot
b.log.Warn("xai returned empty completion (billed, reacting)", "sender", ev.Sender, "usd", usd) // stays consumed (the 2xx was real); no refund, or an empty reply could be forced
// to dodge the cap.
b.log.Warn("empty completion (billed, reacting)", "sender", ev.Sender, "usd", res.cost.Total())
rl.Degraded = degradeEmpty
b.react(ctx, roomID, ev.EventID, reactError) b.react(ctx, roomID, ev.EventID, reactError)
return return
} }
b.log.Info("answered", "room", roomID, "sender", ev.Sender, "dm", isDM, b.log.Info("answered", "room", roomID, "sender", ev.Sender, "dm", isDM, "route", res.route,
"usd", usd, "prompt_tokens", resp.Usage.PromptTokens, "completion_tokens", resp.Usage.CompletionTokens) "usd", res.cost.Total(), "prompt_tokens", res.usage.PromptTokens, "completion_tokens", res.usage.CompletionTokens)
b.sendReply(ctx, roomID, ev, mc, text) if err := b.sendReply(ctx, roomID, ev, mc, text); err != nil {
// Paid silence (§8.1): the spend is real (USD is kept — refunding it would
// under-count the ceiling), but the reply never landed. Refund the request SLOT
// so the user can retry, and react ⚠️ so the failure isn't silent.
b.log.Error("send reply failed after billing; refunding slot + reacting", "sender", ev.Sender, "err", err)
rl.Degraded, rl.Err = degradeSendFailed, err.Error()
if rerr := b.st.RefundRequest(ev.Sender); rerr != nil {
b.log.Error("refund failed", "sender", ev.Sender, "err", rerr)
}
b.react(ctx, roomID, ev.EventID, reactError)
return
}
rl.OK = true
} }
// computeUSD prices the call from the API-returned token usage (authoritative // maxPromptTokens bounds the assembled prompt (history is trimmed to fit) and feeds
// counts) and the configured per-1M prices — so the hard ceiling tracks real // the reservation estimate, so the two never disagree about a request's size.
// usage even if the model/price changes (only the constants need updating). const maxPromptTokens = 8000
func computeUSD(u xaiUsage, cfg *Config) float64 {
cached := u.PromptTokensDetails.CachedTokens // estimateUSD is the conservative max-cost reserved for a route before the call, so
nonCached := u.PromptTokens - cached // the global ceiling can count an in-flight request (§8.1). It prices a full prompt
// (maxPromptTokens) plus the max output at the model's non-cached rates — an upper-ish
// bound, since real calls send fewer tokens and get the cheaper cached rate. Settle
// later books the authoritative actual cost regardless, so a slightly-off estimate
// only nudges admission, never the final accounting.
func (b *Bot) estimateUSD(model string) float64 {
p := b.cfg.priceFor(model)
return float64(maxPromptTokens)/1e6*p.InputPerM + float64(b.cfg.MaxOutTok)/1e6*p.OutputPerM
}
// convID returns the prompt-cache routing hint sent as x-grok-conv-id, or "" when
// GROK_PROMPT_CACHE is off. Grok caches prompt prefixes automatically; the header
// only pins a conversation to the same backend to raise the hit rate (docs.x.ai), so
// a stable per-room id is the right unit — every turn in a room shares the system
// prompt and history prefix. It carries no PII (the room id is opaque) and is hashed
// to keep it compact and non-identifying.
func (b *Bot) convID(roomID string) string {
if !b.cfg.GrokPromptCache {
return ""
}
return fmt.Sprintf("vojo-%08x", hashString(roomID))
}
// computeUSD prices a call from the API-returned token usage (authoritative
// counts) and the per-model price table — so the hard ceiling tracks real usage
// even if the model/price changes (only the price table needs updating), and a
// call books at the price of the model that actually served it.
func computeUSD(model string, u Usage, cfg *Config) float64 {
p := cfg.priceFor(model)
nonCached := u.PromptTokens - u.CachedTokens
if nonCached < 0 { if nonCached < 0 {
nonCached = 0 nonCached = 0
} }
return float64(nonCached)/1e6*cfg.PriceInputPerM + return float64(nonCached)/1e6*p.InputPerM +
float64(cached)/1e6*cfg.PriceCachedPerM + float64(u.CachedTokens)/1e6*p.CachedPerM +
float64(u.CompletionTokens)/1e6*cfg.PriceOutputPerM float64(u.CompletionTokens)/1e6*p.OutputPerM
} }
// react adds an emoji m.reaction to the triggering event — the bot's language-free // react adds an emoji m.reaction to the triggering event — the bot's language-free
@ -415,27 +584,31 @@ func (b *Bot) react(ctx context.Context, roomID, eventID, emoji string) {
// default, so this is a near-dead safety path; the reaction is far less intrusive // default, so this is a near-dead safety path; the reaction is far less intrusive
// than the old text notice, but the once-gate keeps it from annotating every message // than the old text notice, but the once-gate keeps it from annotating every message
// in the rare encrypted room. // in the rare encrypted room.
func (b *Bot) reactEncryptedOnce(ctx context.Context, roomID, eventID string) { // reactEncryptedOnce returns whether it reacted this call (true only the first time
// for a room), so the caller can log the skip exactly once too.
func (b *Bot) reactEncryptedOnce(ctx context.Context, roomID, eventID string) bool {
warned, err := b.st.HasWarnedEncrypted(roomID) warned, err := b.st.HasWarnedEncrypted(roomID)
if err != nil { if err != nil {
b.log.Error("warned-flag read failed", "room", roomID, "err", err) b.log.Error("warned-flag read failed", "room", roomID, "err", err)
return return false
} }
if warned { if warned {
return return false
} }
b.react(ctx, roomID, eventID, reactEncrypted) b.react(ctx, roomID, eventID, reactEncrypted)
if err := b.st.SetWarnedEncrypted(roomID); err != nil { if err := b.st.SetWarnedEncrypted(roomID); err != nil {
b.log.Error("persist warned-flag failed", "room", roomID, "err", err) b.log.Error("persist warned-flag failed", "room", roomID, "err", err)
} }
return true
} }
// sendReply sends the model's actual answer and records the completed exchange in the // sendReply sends the model's actual answer and records the completed exchange in the
// conversation buffer so the next turn has context. // conversation buffer so the next turn has context. It RETURNS the send error so the
func (b *Bot) sendReply(ctx context.Context, roomID string, trigger *Event, triggerMC *MessageContent, body string) { // caller can handle paid silence (§8.1): a billed answer that failed to deliver must
id := b.sendMessage(ctx, roomID, trigger, triggerMC, body) // refund the slot and react, not vanish.
if id == "" { func (b *Bot) sendReply(ctx context.Context, roomID string, trigger *Event, triggerMC *MessageContent, body string) error {
return if err := b.sendMessage(ctx, roomID, trigger, triggerMC, body); err != nil {
return err
} }
// Record the user trigger AND the assistant answer together, only AFTER the answer // Record the user trigger AND the assistant answer together, only AFTER the answer
// was sent, so a failed or empty generation never leaves a dangling user turn (a // was sent, so a failed or empty generation never leaves a dangling user turn (a
@ -443,20 +616,21 @@ func (b *Bot) sendReply(ctx context.Context, roomID string, trigger *Event, trig
// Single-flight guarantees no other turn for this room interleaves between the two. // Single-flight guarantees no other turn for this room interleaves between the two.
b.appendBuf(roomID, bufferedMsg{sender: trigger.Sender, body: triggerMC.Body, isBot: false}) b.appendBuf(roomID, bufferedMsg{sender: trigger.Sender, body: triggerMC.Body, isBot: false})
b.appendBuf(roomID, bufferedMsg{sender: b.cfg.BotMXID, body: body, isBot: true}) b.appendBuf(roomID, bufferedMsg{sender: b.cfg.BotMXID, body: body, isBot: true})
return nil
} }
// sendMessage builds and sends an m.notice reply, tracks our own event id, and returns // sendMessage builds and sends an m.notice reply and tracks our own event id. Returns
// the new event id ("" on failure). // the send error (nil on success) so the caller can detect a failed delivery.
func (b *Bot) sendMessage(ctx context.Context, roomID string, trigger *Event, triggerMC *MessageContent, body string) string { func (b *Bot) sendMessage(ctx context.Context, roomID string, trigger *Event, triggerMC *MessageContent, body string) error {
content := buildNoticeContent(trigger.EventID, trigger.Sender, triggerMC.RelatesTo, body) content := buildNoticeContent(trigger.EventID, trigger.Sender, triggerMC.RelatesTo, body)
id, err := b.mx.SendEvent(ctx, roomID, "m.room.message", content) id, err := b.mx.SendEvent(ctx, roomID, "m.room.message", content)
if err != nil { if err != nil {
b.log.Error("send failed", "room", roomID, "err", err) b.log.Error("send failed", "room", roomID, "err", err)
return "" return err
} }
// Track our own reply so a future reply-to-it is recognised as addressing us. // Track our own reply so a future reply-to-it is recognised as addressing us.
b.botSent.Add(id) b.botSent.Add(id)
return id return nil
} }
// startTypingKeepalive starts the typing indicator and keeps it alive for the whole // startTypingKeepalive starts the typing indicator and keeps it alive for the whole

View file

@ -71,17 +71,22 @@ func TestStripReplyFallback(t *testing.T) {
} }
func TestComputeUSD(t *testing.T) { func TestComputeUSD(t *testing.T) {
cfg := &Config{PriceInputPerM: 1.25, PriceCachedPerM: 0.20, PriceOutputPerM: 2.50} const model = "grok-test"
var u xaiUsage cfg := &Config{XAIModel: model, Prices: map[string]ModelPrice{
u.PromptTokens = 1_000_000 model: {InputPerM: 1.25, CachedPerM: 0.20, OutputPerM: 2.50},
u.PromptTokensDetails.CachedTokens = 400_000 }}
u.CompletionTokens = 1_000_000 u := Usage{PromptTokens: 1_000_000, CachedTokens: 400_000, CompletionTokens: 1_000_000}
// nonCached 600k*1.25 + cached 400k*0.20 + out 1M*2.50 = 0.75 + 0.08 + 2.50 // nonCached 600k*1.25 + cached 400k*0.20 + out 1M*2.50 = 0.75 + 0.08 + 2.50
got := computeUSD(u, cfg) got := computeUSD(model, u, cfg)
want := 0.75 + 0.08 + 2.50 want := 0.75 + 0.08 + 2.50
if diff := got - want; diff > 1e-9 || diff < -1e-9 { if diff := got - want; diff > 1e-9 || diff < -1e-9 {
t.Fatalf("computeUSD = %v, want %v", got, want) t.Fatalf("computeUSD = %v, want %v", got, want)
} }
// An unknown model falls back to the default model's price (never $0, which would
// blind the ceiling).
if got := computeUSD("unknown-model", u, cfg); got != want {
t.Fatalf("unknown-model fallback = %v, want default %v", got, want)
}
} }
func TestBuildContextGroupDropsThirdParties(t *testing.T) { func TestBuildContextGroupDropsThirdParties(t *testing.T) {

279
apps/ai-bot/cascade.go Normal file
View file

@ -0,0 +1,279 @@
package main
import (
"context"
"errors"
"fmt"
"strings"
"time"
)
// cascade.go is the generation half of the bot: given an admitted request, it routes
// (router.go), runs the chosen route's provider(s), and ALWAYS degrades to grok_direct
// on any layer being off or failing (§8.2). It returns a genResult the business logic
// (respond) settles, sends, and logs — keeping ledger/never-silent/telemetry in one
// place and the routing here. With every cascade flag off, classify returns grok_direct
// and this collapses to exactly today's single Grok call.
// genResult is everything respond needs from a generation: the answer, the model's
// usage (for token billing), the FULL cost breakdown (router + web + final), and the
// routing metadata for telemetry. cost accumulates across stages, so a partial cascade
// (a paid web fetch that then degraded) still books what it actually spent.
type genResult struct {
text string
usage Usage
cost CostBreakdown
finalModel string
providerID string
decision RouterDecision
route string // the route actually taken (may differ from decision on degrade)
fallback bool // true if we degraded off the decided route
degraded string // degrade reason for request_log
stageMS map[string]int
}
func msSince(t time.Time) int { return int(time.Since(t).Milliseconds()) }
// reserveEstimate is the admission envelope: the most expensive ENABLED route's cost,
// so whichever route the router picks is covered by the reservation (the ceiling can't
// be slipped by routing to a pricier path after admission). With every cascade flag
// off it equals grok_direct's estimate — byte-for-byte today's reservation. Slightly
// generous is fine: Settle books the authoritative actual afterward.
func (b *Bot) reserveEstimate() float64 {
est := b.estimateUSD(b.cfg.XAIModel) // grok_direct / trivial(cheaper)/synthesis base
if b.cfg.WebEnabled {
// web_then_grok = a web fetch fee + the Grok synthesis already counted above.
if b.cfg.WebProvider == webProviderGrokWebSearch {
// fetch can search several times and pull large context; reserve generously.
est += float64(maxWebSearchCalls)*grokWebSearchPerCall + b.estimateUSD(b.cfg.XAIModel)
} else {
est += b.estimateUSD(b.cfg.GeminiModel)
}
}
if b.cfg.ReasoningEnabled {
// Higher reasoning effort can burn more output tokens; reserve double.
est = max(est, 2*b.estimateUSD(b.cfg.ReasoningModel))
}
return est
}
// generate routes and produces an answer, degrading to grok_direct on any failure.
// It returns a terminal error ONLY if even grok_direct fails; every other route falls
// through to grok_direct rather than erroring.
func (b *Bot) generate(ctx context.Context, body string, msgs []Message, convID string) (genResult, error) {
res := genResult{stageMS: map[string]int{}, finalModel: b.cfg.XAIModel}
t0 := time.Now()
res.decision = b.classify(ctx, body, &res.cost) // accumulates cost.Router if Layer-1 runs
res.stageMS["router"] = msSince(t0)
res.route = res.decision.Route
finalMsgs := msgs
switch res.decision.Route {
case routeTrivial:
if b.cfg.TrivialOffloadEnabled && b.gemini != nil {
if err := b.genTrivial(ctx, msgs, &res); err == nil {
return res, nil
} else {
b.log.Warn("trivial offload failed; degrading to grok_direct", "err", err)
b.degradeTo(&res, degradeTrivial)
}
}
case routeWebThenGrok:
if b.cfg.WebEnabled && b.web != nil {
if err := b.genWebThenGrok(ctx, body, msgs, convID, &res); err == nil {
return res, nil
} else {
b.log.Warn("web route failed; degrading to grok_direct", "err", err, "reason", res.degraded)
b.degradeTo(&res, degradeWeb)
// The question wanted fresh facts but we have none — answer from training
// knowledge WITH an honest staleness caveat, not stale-as-current (§8.2.1).
finalMsgs = hedgeMessages(msgs)
}
}
case routeReason:
if b.cfg.ReasoningEnabled {
if err := b.genReason(ctx, msgs, convID, &res); err == nil {
return res, nil
} else {
b.log.Warn("reasoning route failed; degrading to grok_direct", "err", err)
b.degradeTo(&res, degradeReasoning)
}
}
}
// grok_direct — the default route AND the universal fallback. The only path that
// can return a terminal error (even Grok failed). It preserves any cost already
// spent (router classifier, a partial web fetch) in res.cost.
if err := b.genGrokDirect(ctx, finalMsgs, convID, &res); err != nil {
return res, err
}
return res, nil
}
// degradeTo marks res as a fallback to grok_direct, keeping the first/most-specific
// degrade reason (e.g. a web provider's grounding_cap set inside genWebThenGrok).
func (b *Bot) degradeTo(res *genResult, reason string) {
res.fallback = true
if res.degraded == "" {
res.degraded = reason
}
}
// genGrokDirect is today's path: one Grok call. Also the fallback for every other
// route. On success it fills res (route, final model, text, usage, provider id) and
// adds the token cost.
func (b *Bot) genGrokDirect(ctx context.Context, msgs []Message, convID string, res *genResult) error {
t := time.Now()
resp, err := b.llm.Complete(ctx, LLMRequest{
Model: b.cfg.XAIModel,
Messages: msgs,
MaxTokens: b.cfg.MaxOutTok,
Temperature: b.cfg.XAITemp,
ConvID: convID,
ReasoningEffort: b.cfg.GrokReasoningEffort, // "" → not sent; "none" keeps grok-4.3 fast
})
res.stageMS["final"] = msSince(t)
if err != nil {
return err
}
res.route, res.finalModel = routeGrokDirect, b.cfg.XAIModel
res.text, res.usage, res.providerID = resp.Text, resp.Usage, resp.ProviderRequestID
res.cost.Token += computeUSD(b.cfg.XAIModel, resp.Usage, b.cfg)
return nil
}
// genTrivial answers a trivial message with the cheap Gemini model. An empty reply is
// treated as a failure so the caller degrades to Grok rather than sending nothing.
func (b *Bot) genTrivial(ctx context.Context, msgs []Message, res *genResult) error {
t := time.Now()
resp, err := b.gemini.Complete(ctx, LLMRequest{
Model: b.cfg.GeminiModel,
Messages: msgs,
MaxTokens: b.cfg.MaxOutTok,
Temperature: b.cfg.XAITemp,
})
res.stageMS["final"] = msSince(t)
if err != nil {
return err
}
if strings.TrimSpace(resp.Text) == "" {
return fmt.Errorf("trivial: empty Gemini reply")
}
res.route, res.finalModel = routeTrivial, b.cfg.GeminiModel
res.text, res.usage, res.providerID = resp.Text, resp.Usage, resp.ProviderRequestID
res.cost.Token += computeUSD(b.cfg.GeminiModel, resp.Usage, b.cfg)
return nil
}
// genReason answers with Grok at a higher reasoning effort. Uses the configured
// reasoning-capable model (the default grok-4.20-non-reasoning would reject the param).
func (b *Bot) genReason(ctx context.Context, msgs []Message, convID string, res *genResult) error {
t := time.Now()
resp, err := b.llm.Complete(ctx, LLMRequest{
Model: b.cfg.ReasoningModel,
Messages: msgs,
MaxTokens: b.cfg.MaxOutTok,
Temperature: b.cfg.XAITemp,
ReasoningEffort: b.cfg.ReasoningEffort, // "think harder" level (default high)
ConvID: convID,
})
res.stageMS["final"] = msSince(t)
if err != nil {
return err
}
if strings.TrimSpace(resp.Text) == "" {
return fmt.Errorf("reason: empty reply")
}
res.route, res.finalModel = routeReason, b.cfg.ReasoningModel
res.text, res.usage, res.providerID = resp.Text, resp.Usage, resp.ProviderRequestID
res.cost.Token += computeUSD(b.cfg.ReasoningModel, resp.Usage, b.cfg)
return nil
}
// webStageTimeout bounds the web/grounding fetch independently of the overall budget
// (§8.2.2): a slow search must not eat the whole request before synthesis.
const webStageTimeout = 15 * time.Second
// genWebThenGrok fetches fresh facts via the web provider, then has Grok synthesise the
// answer in voice from that digest. The web fetch's cost+tokens are booked into res
// EVEN ON FAILURE — the call was billed — so a synth failure or empty fetch still
// accounts for the spend before the caller degrades to grok_direct (the partial cascade
// case, §8.1). The daily cap and per-stage deadline are applied here, uniformly for both
// providers.
func (b *Bot) genWebThenGrok(ctx context.Context, body string, msgs []Message, convID string, res *genResult) error {
// Per-stage web/grounding deadline, independent of the overall budget.
wctx, cancelW := context.WithTimeout(ctx, webStageTimeout)
tw := time.Now()
wc, ferr := b.web.Fetch(wctx, body)
cancelW()
res.stageMS["web"] = msSince(tw)
// Book the fetch's fee + tokens whether or not it produced a usable digest — the call
// was billed (the daily cap, if any, is enforced inside the provider).
res.cost.Grounding += wc.Cost.Grounding
res.cost.WebTool += wc.Cost.WebTool
webUsage := wc.Usage
if ferr != nil {
if errors.Is(ferr, errGroundingCapped) {
res.degraded = degradeGroundCap
}
return ferr // web fee already booked; caller degrades to grok_direct (with hedge)
}
tf := time.Now()
resp, err := b.llm.Complete(ctx, LLMRequest{
Model: b.cfg.XAIModel,
Messages: webSynthMessages(msgs, wc),
MaxTokens: b.cfg.MaxOutTok,
Temperature: b.cfg.XAITemp,
ConvID: convID,
ReasoningEffort: b.cfg.GrokReasoningEffort, // same voice, same effort as grok_direct
})
res.stageMS["final"] = msSince(tf)
if err != nil {
return err
}
if strings.TrimSpace(resp.Text) == "" {
return fmt.Errorf("web synth: empty reply")
}
res.route, res.finalModel = routeWebThenGrok, b.cfg.XAIModel
res.text, res.providerID = resp.Text, resp.ProviderRequestID
// Report BOTH calls' tokens so the analytics token totals match the two-call route.
res.usage = Usage{
PromptTokens: resp.Usage.PromptTokens + webUsage.PromptTokens,
CachedTokens: resp.Usage.CachedTokens + webUsage.CachedTokens,
CompletionTokens: resp.Usage.CompletionTokens + webUsage.CompletionTokens,
}
res.cost.Token += computeUSD(b.cfg.XAIModel, resp.Usage, b.cfg)
return nil
}
// webSynthMessages inserts the fresh web digest (and its sources) as a system note just
// after the system prompt, so Grok answers in voice using current facts.
func webSynthMessages(base []Message, wc WebContext) []Message {
facts := "Свежие данные из веба (используй их в ответе и сошлись на источники):\n" + wc.Digest
if len(wc.Citations) > 0 {
facts += "\nИсточники: " + strings.Join(wc.Citations, ", ")
}
return insertSystemNote(base, facts)
}
// hedgeMessages adds an honest staleness caveat for a web→grok_direct degrade: the user
// wanted fresh facts but we couldn't fetch them, so the model must flag that its answer
// is from training knowledge and may be out of date.
func hedgeMessages(base []Message) []Message {
return insertSystemNote(base, "Нет доступа к свежим источникам прямо сейчас — отвечай по знаниям на момент обучения и честно предупреди, что данные могут быть устаревшими.")
}
// insertSystemNote inserts an extra system message right after the system prompt
// (base[0] from buildContext), preserving the rest of the window.
func insertSystemNote(base []Message, content string) []Message {
note := Message{Role: "system", Content: content}
if len(base) == 0 {
return []Message{note}
}
out := make([]Message, 0, len(base)+1)
out = append(out, base[0], note)
out = append(out, base[1:]...)
return out
}

275
apps/ai-bot/cascade_test.go Normal file
View file

@ -0,0 +1,275 @@
package main
import (
"context"
"errors"
"io"
"log/slog"
"testing"
)
func discardLog() *slog.Logger { return slog.New(slog.NewTextHandler(io.Discard, nil)) }
// fakeLLM is a scriptable LLMClient for dispatch/degrade tests.
type fakeLLM struct {
text string
usage Usage
err error
calls int
lastReq LLMRequest
}
func (f *fakeLLM) Complete(_ context.Context, req LLMRequest) (*LLMResponse, error) {
f.calls++
f.lastReq = req
if f.err != nil {
return nil, f.err
}
return &LLMResponse{Text: f.text, Usage: f.usage, ProviderRequestID: "fake"}, nil
}
type fakeWeb struct {
wc WebContext
err error
calls int
}
func (f *fakeWeb) Fetch(_ context.Context, _ string) (WebContext, error) {
f.calls++
if f.err != nil {
return WebContext{}, f.err
}
return f.wc, nil
}
// cascadeCfg is a config with the model/price table set and EVERY cascade flag off.
// Tests flip individual flags on a copy.
func cascadeCfg() Config {
return Config{
XAIModel: "grok-x", GeminiModel: "gemini-x", ReasoningModel: "grok-reason",
MaxOutTok: 100, XAITemp: 0.5,
ReasoningTrigger: "подумай глубже",
ReasoningEffort: "high",
WebProvider: webProviderGrokWebSearch,
Prices: map[string]ModelPrice{
"grok-x": {InputPerM: 1, CachedPerM: 0.2, OutputPerM: 2},
"gemini-x": {InputPerM: 0.1, CachedPerM: 0.1, OutputPerM: 0.4},
},
}
}
func msgs(body string) []Message {
return []Message{{Role: "system", Content: "SYS"}, {Role: "user", Content: body}}
}
// TestGenerateAllFlagsOffIsGrokDirect is the cascade-off parity invariant: even a
// "trivial"-looking message goes to Grok, and Gemini is never touched, when the router
// is off.
func TestGenerateAllFlagsOffIsGrokDirect(t *testing.T) {
grok := &fakeLLM{text: "grok answer"}
gem := &fakeLLM{text: "should not run"}
cfg := cascadeCfg()
b := &Bot{cfg: &cfg, llm: grok, gemini: gem, log: discardLog()}
res, err := b.generate(context.Background(), "привет", msgs("привет"), "")
if err != nil {
t.Fatalf("generate: %v", err)
}
if res.route != routeGrokDirect || res.text != "grok answer" {
t.Fatalf("res = (%q,%q), want grok_direct/\"grok answer\"", res.route, res.text)
}
if res.decision.Source != "default" {
t.Fatalf("router source = %q, want default (router off)", res.decision.Source)
}
if grok.calls != 1 || gem.calls != 0 {
t.Fatalf("calls grok=%d gem=%d, want 1/0", grok.calls, gem.calls)
}
}
func TestGenerateTrivialOffload(t *testing.T) {
grok := &fakeLLM{text: "grok"}
gem := &fakeLLM{text: "gemini trivial"}
cfg := cascadeCfg()
cfg.RouterEnabled, cfg.TrivialOffloadEnabled = true, true
b := &Bot{cfg: &cfg, llm: grok, gemini: gem, log: discardLog()}
res, err := b.generate(context.Background(), "привет", msgs("привет"), "")
if err != nil {
t.Fatalf("generate: %v", err)
}
if res.route != routeTrivial || res.text != "gemini trivial" || res.finalModel != "gemini-x" {
t.Fatalf("res = (%q,%q,%q), want trivial/gemini", res.route, res.text, res.finalModel)
}
if gem.calls != 1 || grok.calls != 0 {
t.Fatalf("calls grok=%d gem=%d, want 0/1 (Gemini answered)", grok.calls, gem.calls)
}
}
// TestGenerateTrivialDegradesToGrok: Gemini failing on the trivial route must fall back
// to Grok, never go silent.
func TestGenerateTrivialDegradesToGrok(t *testing.T) {
grok := &fakeLLM{text: "grok fallback"}
gem := &fakeLLM{err: errors.New("gemini down")}
cfg := cascadeCfg()
cfg.RouterEnabled, cfg.TrivialOffloadEnabled = true, true
b := &Bot{cfg: &cfg, llm: grok, gemini: gem, log: discardLog()}
res, err := b.generate(context.Background(), "привет", msgs("привет"), "")
if err != nil {
t.Fatalf("generate: %v", err)
}
if res.route != routeGrokDirect || res.text != "grok fallback" {
t.Fatalf("res = (%q,%q), want grok_direct fallback", res.route, res.text)
}
if !res.fallback || res.degraded != degradeTrivial {
t.Fatalf("fallback=%v degraded=%q, want true/trivial_failed", res.fallback, res.degraded)
}
if gem.calls != 1 || grok.calls != 1 {
t.Fatalf("calls grok=%d gem=%d, want 1/1", grok.calls, gem.calls)
}
}
func TestGenerateWebThenGrok(t *testing.T) {
grok := &fakeLLM{text: "synthesised", usage: Usage{PromptTokens: 100, CompletionTokens: 50}}
web := &fakeWeb{wc: WebContext{Digest: "fresh facts", Citations: []string{"http://src"}, Cost: CostBreakdown{WebTool: 0.1}}}
cfg := cascadeCfg()
cfg.RouterEnabled, cfg.WebEnabled = true, true
b := &Bot{cfg: &cfg, llm: grok, web: web, log: discardLog()}
res, err := b.generate(context.Background(), "какие новости сегодня", msgs("какие новости сегодня"), "")
if err != nil {
t.Fatalf("generate: %v", err)
}
if res.route != routeWebThenGrok || res.text != "synthesised" {
t.Fatalf("res = (%q,%q), want web_then_grok/synthesised", res.route, res.text)
}
if res.cost.WebTool != 0.1 || res.cost.Token <= 0 {
t.Fatalf("cost = %+v, want WebTool 0.1 + Token>0", res.cost)
}
if web.calls != 1 || grok.calls != 1 {
t.Fatalf("calls web=%d grok=%d, want 1/1", web.calls, grok.calls)
}
}
// TestGenerateWebDegradesToGrok: a web fetch failure (provider down or cap hit) degrades
// to grok_direct and books no web cost.
func TestGenerateWebDegradesToGrok(t *testing.T) {
grok := &fakeLLM{text: "grok fallback"}
web := &fakeWeb{err: errGroundingCapped}
cfg := cascadeCfg()
cfg.RouterEnabled, cfg.WebEnabled = true, true
b := &Bot{cfg: &cfg, llm: grok, web: web, log: discardLog()}
res, err := b.generate(context.Background(), "новости сегодня", msgs("новости сегодня"), "")
if err != nil {
t.Fatalf("generate: %v", err)
}
if res.route != routeGrokDirect || res.text != "grok fallback" || !res.fallback {
t.Fatalf("res = (%q,%q,fallback=%v), want grok_direct fallback", res.route, res.text, res.fallback)
}
if res.degraded != degradeGroundCap {
t.Fatalf("degraded = %q, want grounding_cap (the specific reason)", res.degraded)
}
if res.cost.WebTool != 0 || res.cost.Grounding != 0 {
t.Fatalf("web cost = %+v, want 0 (fetch failed before billing)", res.cost)
}
}
// TestGenerateReasoningForced: the manual trigger routes to the reasoning model with
// reasoning_effort, independent of ROUTER_ENABLED.
func TestGenerateReasoningForced(t *testing.T) {
grok := &fakeLLM{text: "deep answer"}
cfg := cascadeCfg()
cfg.ReasoningEnabled = true // ROUTER_ENABLED deliberately left off
b := &Bot{cfg: &cfg, llm: grok, log: discardLog()}
res, err := b.generate(context.Background(), "подумай глубже про сознание", msgs("подумай глубже про сознание"), "")
if err != nil {
t.Fatalf("generate: %v", err)
}
if res.route != routeReason || res.decision.Source != "forced" {
t.Fatalf("res route=%q source=%q, want reason/forced", res.route, res.decision.Source)
}
if grok.lastReq.ReasoningEffort != "high" || grok.lastReq.Model != "grok-reason" {
t.Fatalf("reasoning req = (effort %q, model %q), want high/grok-reason", grok.lastReq.ReasoningEffort, grok.lastReq.Model)
}
}
// TestClassifierConfidenceFloor: a Layer-1 classifier label that escalates off the safe
// floor (trivial/web) must clear the confidence floor, else the request stays on
// grok_direct — the false-trivial voice-leak guard (§8.6).
func TestClassifierConfidenceFloor(t *testing.T) {
cfg := cascadeCfg()
cfg.RouterEnabled, cfg.RouterClassifierEnabled = true, true
gem := &fakeLLM{}
b := &Bot{cfg: &cfg, gemini: gem, log: discardLog()}
var cost CostBreakdown
const substantive = "напиши подробное эссе про историю римской империи" // Layer-0 → grok_direct
gem.text = `{"route":"trivial","confidence":0.2}` // low-confidence escalation
if d := b.classify(context.Background(), substantive, &cost); d.Route != routeGrokDirect {
t.Fatalf("low-confidence trivial must stay grok_direct (safe floor), got %q", d.Route)
}
gem.text = `{"route":"trivial","confidence":0.95}` // confident escalation is honoured
if d := b.classify(context.Background(), substantive, &cost); d.Route != routeTrivial {
t.Fatalf("high-confidence trivial should route trivial, got %q", d.Route)
}
// A classifier error degrades to the Layer-0 verdict (grok_direct), never silence.
gem.text, gem.err = "", errors.New("gemini down")
if d := b.classify(context.Background(), substantive, &cost); d.Route != routeGrokDirect {
t.Fatalf("classifier failure must fall back to heuristic grok_direct, got %q", d.Route)
}
}
// TestGrokReasoningEffort: GROK_REASONING_EFFORT is sent on grok_direct (so grok-4.3 can
// be kept fast with "none"), empty means not sent (compat with grok-4.20-non-reasoning),
// and the reason route always overrides to "high" regardless.
func TestGrokReasoningEffort(t *testing.T) {
// Configured effort reaches grok_direct.
grok := &fakeLLM{text: "ok"}
cfg := cascadeCfg()
cfg.GrokReasoningEffort = "none"
b := &Bot{cfg: &cfg, llm: grok, log: discardLog()}
if _, err := b.generate(context.Background(), "hello", msgs("hello"), ""); err != nil {
t.Fatal(err)
}
if grok.lastReq.ReasoningEffort != "none" {
t.Fatalf("grok_direct effort = %q, want none", grok.lastReq.ReasoningEffort)
}
// Empty default → not sent (so grok-4.20-non-reasoning keeps working).
grokDef := &fakeLLM{text: "ok"}
cfgDef := cascadeCfg() // GrokReasoningEffort == ""
bDef := &Bot{cfg: &cfgDef, llm: grokDef, log: discardLog()}
if _, err := bDef.generate(context.Background(), "hello", msgs("hello"), ""); err != nil {
t.Fatal(err)
}
if grokDef.lastReq.ReasoningEffort != "" {
t.Fatalf("default effort = %q, want empty (not sent)", grokDef.lastReq.ReasoningEffort)
}
// The reason route ignores GROK_REASONING_EFFORT and always uses "high".
grokR := &fakeLLM{text: "deep"}
cfgR := cascadeCfg()
cfgR.GrokReasoningEffort = "none"
cfgR.ReasoningEnabled = true
bR := &Bot{cfg: &cfgR, llm: grokR, log: discardLog()}
if _, err := bR.generate(context.Background(), "подумай глубже про X", msgs("подумай глубже про X"), ""); err != nil {
t.Fatal(err)
}
if grokR.lastReq.ReasoningEffort != "high" {
t.Fatalf("reason route effort = %q, want high (overrides GROK_REASONING_EFFORT)", grokR.lastReq.ReasoningEffort)
}
}
// TestGenerateTerminalErrorPropagates: if even grok_direct fails, generate returns the
// error (respond turns it into refund + react), not a silent empty success.
func TestGenerateTerminalErrorPropagates(t *testing.T) {
grok := &fakeLLM{err: errors.New("xai down")}
cfg := cascadeCfg()
b := &Bot{cfg: &cfg, llm: grok, log: discardLog()}
if _, err := b.generate(context.Background(), "hello", msgs("hello"), ""); err == nil {
t.Fatal("want terminal error when grok_direct fails, got nil")
}
}

View file

@ -5,6 +5,7 @@ import (
"os" "os"
"strconv" "strconv"
"strings" "strings"
"time"
) )
// Config is the fully-resolved runtime configuration, parsed once from the // Config is the fully-resolved runtime configuration, parsed once from the
@ -36,23 +37,106 @@ type Config struct {
MaxOutTok int MaxOutTok int
MaxCtxEvent int MaxCtxEvent int
// GrokReasoningEffort is the reasoning_effort sent on the normal Grok voice calls
// (grok_direct + web synthesis). Empty = don't send it (the default — required for
// grok-4.20-non-reasoning, which rejects the param). On a unified model like
// grok-4.3 the API otherwise defaults to "low" (it thinks on every reply); set this
// to "none" to keep the default voice fast/cheap. The reason_then_grok route ignores
// this and always uses "high". Accepted: "" | none | low | medium | high.
GrokReasoningEffort string
// Allowlist of homeservers whose users may pull the bot into a room. Gates // Allowlist of homeservers whose users may pull the bot into a room. Gates
// the *inviter* (F11). Comma-separated env, stored as a set. // the *inviter* (F11). Comma-separated env, stored as a set.
AllowedServers map[string]bool AllowedServers map[string]bool
DailyUSDCeiling float64 DailyUSDCeiling float64
PerUserDailyCap int PerUserDailyCap int
// PerUserDailyUSD is an optional per-user daily $ quota (0 = off) on top of the
// request count cap, so one user on expensive routes can't drain the shared global
// ceiling and deny everyone else. Checked against the user's own committed+reserved
// spend in Reserve.
PerUserDailyUSD float64
// mxids exempt from PER_USER_DAILY_CAP (e.g. the owner/admins testing). Still // mxids exempt from PER_USER_DAILY_CAP (e.g. the owner/admins testing). Still
// subject to the global DAILY_USD_CEILING, so the wallet stays protected. // subject to the global DAILY_USD_CEILING, so the wallet stays protected.
UnlimitedUsers map[string]bool UnlimitedUsers map[string]bool
// USD-per-1M-token prices applied to the API-returned token usage so the // USD-per-1M-token prices for the default (final-voice) model, applied to the
// hard ceiling tracks real usage even if the model/price changes. // API-returned token usage so the hard ceiling tracks real usage even if the
// model/price changes. Kept as the back-compat XAI_PRICE_* source; folded into
// Prices below.
PriceInputPerM float64 PriceInputPerM float64
PriceCachedPerM float64 PriceCachedPerM float64
PriceOutputPerM float64 PriceOutputPerM float64
// Prices is the per-model price table (LiteLLM pattern) read by priceFor(model),
// so a call books at the price of the model that actually served it. Built in
// LoadConfig; the default model's entry comes from the XAI_PRICE_* envs, and a
// second model (Gemini) adds its own entry when that layer lands.
Prices map[string]ModelPrice
// RequestBudget bounds one whole request (all model calls share it), so a slow or
// retried call — or a multi-stage cascade — can't accrete minutes. The default
// matches the previous effective ceiling for a single grok_direct call.
RequestBudget time.Duration
// GrokPromptCache, when true, sends the x-grok-conv-id routing header to raise the
// prompt-cache hit rate (Grok caches automatically; the header only pins routing).
GrokPromptCache bool
// TelemetryEnabled writes the request_log analytics row for every request. Default
// off so the cascade-off path adds no extra write; turned on to measure the base.
// Its write is isolated — a failure logs a WARN, never drops the answer.
TelemetryEnabled bool
// TelemetryStoreText additionally stores the query text in request_log (for offline
// eval). Default off — only metadata is kept.
TelemetryStoreText bool
// TelemetryRetention trims request_log rows older than this (time-based, since the
// analytics are a time series). 0 disables trimming.
TelemetryRetention time.Duration
// --- Cascade (Phase 2-4). EVERY flag defaults OFF, so an unset environment is
// exactly today's bot: one grok_direct call. Any layer off or failing degrades to
// grok_direct (§8.2). None of these is enabled in prod until the offline-eval gate
// (§9) passes. ---
// RouterEnabled turns on the Layer-0 heuristic router; off → everything is
// grok_direct. RouterClassifierEnabled additionally consults the Gemini Layer-1
// classifier on cases the heuristic left as grok_direct.
RouterEnabled bool
RouterClassifierEnabled bool
// TrivialOffloadEnabled lets the trivial route answer with Gemini; off → trivial
// still goes to Grok.
TrivialOffloadEnabled bool
// WebEnabled turns on the web_then_grok route. WebProvider selects the source:
// grok_web_search (default, works on chat/completions via Live Search) or
// gemini_grounding (Gemini-3 native only — see F-EXT-3).
WebEnabled bool
WebProvider string
// WebGroundingDailyCap caps grounded prompts/day (durable counter) before falling
// back, guarding the $/1k grounding overage. WebGroundingTier records the Gemini
// plan the cap reflects.
WebGroundingDailyCap int
WebGroundingTier string
// Reasoning route: a manual "think harder" trigger. ReasoningModel must be a
// reasoning-capable model (the default grok-4.20-non-reasoning is NOT — see the
// docs.x.ai finding); set REASONING_MODEL to e.g. grok-4.3 to use it.
ReasoningEnabled bool
ReasoningTrigger string
ReasoningModel string
// ReasoningEffort is the reasoning_effort the reason_then_grok route sends on the
// manual "think harder" trigger. Default "high". Accepted: none|low|medium|high.
ReasoningEffort string
// CanaryPercent routes a fraction of traffic through the new path for A/B before a
// full enable. 0 = off (scaffold; not yet consulted by the dispatch).
CanaryPercent int
// Gemini backend (the cheap/router/grounding model). Required only when a layer
// that uses it is enabled (validated below).
GeminiBaseURL string
GeminiAPIKey string
GeminiModel string
SystemPromptPath string SystemPromptPath string
SystemPrompt string SystemPrompt string
StateDir string StateDir string
@ -111,6 +195,23 @@ func getenvFloat(key string, def float64) (float64, error) {
return f, nil return f, nil
} }
// getenvBool parses a boolean flag. Accepts the usual 1/0/true/false/yes/no/on/off
// (case-insensitive); empty → default. Every cascade flag defaults false, so an unset
// or blank env keeps today's behaviour.
func getenvBool(key string, def bool) (bool, error) {
raw := strings.TrimSpace(getenv(key, ""))
if raw == "" {
return def, nil
}
switch strings.ToLower(raw) {
case "1", "true", "yes", "on":
return true, nil
case "0", "false", "no", "off":
return false, nil
}
return false, fmt.Errorf("%s must be a boolean (true/false), got %q", key, raw)
}
func parseServerSet(raw string) map[string]bool { func parseServerSet(raw string) map[string]bool {
set := make(map[string]bool) set := make(map[string]bool)
for _, s := range strings.Split(raw, ",") { for _, s := range strings.Split(raw, ",") {
@ -139,6 +240,16 @@ func LoadConfig() (*Config, error) {
DatabaseURL: getenv("AI_BOT_DATABASE_URL", ""), DatabaseURL: getenv("AI_BOT_DATABASE_URL", ""),
AllowedServers: parseServerSet(getenv("ALLOWED_SERVERS", "")), AllowedServers: parseServerSet(getenv("ALLOWED_SERVERS", "")),
UnlimitedUsers: parseServerSet(getenv("UNLIMITED_USERS", "")), UnlimitedUsers: parseServerSet(getenv("UNLIMITED_USERS", "")),
// Cascade string-valued config (flags/ints/secrets parsed below).
GrokReasoningEffort: strings.ToLower(strings.TrimSpace(getenv("GROK_REASONING_EFFORT", ""))),
WebProvider: getenv("WEB_PROVIDER", webProviderGrokWebSearch),
WebGroundingTier: getenv("WEB_GROUNDING_TIER", "free"),
ReasoningTrigger: getenv("REASONING_TRIGGER", "подумай глубже"),
ReasoningModel: getenv("REASONING_MODEL", "grok-4.3"),
ReasoningEffort: strings.ToLower(strings.TrimSpace(getenv("REASONING_EFFORT", "high"))),
GeminiBaseURL: strings.TrimRight(getenv("GEMINI_BASE_URL", "https://generativelanguage.googleapis.com/v1beta/openai"), "/"),
GeminiModel: getenv("GEMINI_MODEL", "gemini-2.5-flash-lite"),
} }
var problems []string var problems []string
@ -152,6 +263,7 @@ func LoadConfig() (*Config, error) {
{"AS_TOKEN", &cfg.ASToken}, {"AS_TOKEN", &cfg.ASToken},
{"HS_TOKEN", &cfg.HSToken}, {"HS_TOKEN", &cfg.HSToken},
{"XAI_API_KEY", &cfg.XAIAPIKey}, {"XAI_API_KEY", &cfg.XAIAPIKey},
{"GEMINI_API_KEY", &cfg.GeminiAPIKey}, // optional; required only if a Gemini layer is on
} { } {
v, err := getSecret(s.key) v, err := getSecret(s.key)
if err != nil { if err != nil {
@ -207,6 +319,9 @@ func LoadConfig() (*Config, error) {
if cfg.PerUserDailyCap, err = getenvInt("PER_USER_DAILY_CAP", 30); err != nil { if cfg.PerUserDailyCap, err = getenvInt("PER_USER_DAILY_CAP", 30); err != nil {
problems = append(problems, err.Error()) problems = append(problems, err.Error())
} }
if cfg.PerUserDailyUSD, err = getenvFloat("PER_USER_DAILY_USD", 0); err != nil {
problems = append(problems, err.Error())
}
if cfg.PriceInputPerM, err = getenvFloat("XAI_PRICE_INPUT_PER_M", 1.25); err != nil { if cfg.PriceInputPerM, err = getenvFloat("XAI_PRICE_INPUT_PER_M", 1.25); err != nil {
problems = append(problems, err.Error()) problems = append(problems, err.Error())
} }
@ -216,6 +331,110 @@ func LoadConfig() (*Config, error) {
if cfg.PriceOutputPerM, err = getenvFloat("XAI_PRICE_OUTPUT_PER_M", 2.50); err != nil { if cfg.PriceOutputPerM, err = getenvFloat("XAI_PRICE_OUTPUT_PER_M", 2.50); err != nil {
problems = append(problems, err.Error()) problems = append(problems, err.Error())
} }
// Per-model price table. The default (final-voice) model is priced from the
// XAI_PRICE_* envs; additional models register their own entry as their layer
// lands. priceFor falls back to this default model for an unknown model.
cfg.Prices = map[string]ModelPrice{
cfg.XAIModel: {
InputPerM: cfg.PriceInputPerM,
CachedPerM: cfg.PriceCachedPerM,
OutputPerM: cfg.PriceOutputPerM,
},
}
var budgetSec, retentionDays int
if budgetSec, err = getenvInt("REQUEST_BUDGET_SECONDS", 180); err != nil {
problems = append(problems, err.Error())
}
cfg.RequestBudget = time.Duration(budgetSec) * time.Second
if cfg.GrokPromptCache, err = getenvBool("GROK_PROMPT_CACHE", false); err != nil {
problems = append(problems, err.Error())
}
if cfg.TelemetryEnabled, err = getenvBool("TELEMETRY_ENABLED", false); err != nil {
problems = append(problems, err.Error())
}
if cfg.TelemetryStoreText, err = getenvBool("TELEMETRY_STORE_TEXT", false); err != nil {
problems = append(problems, err.Error())
}
if retentionDays, err = getenvInt("TELEMETRY_RETENTION_DAYS", 30); err != nil {
problems = append(problems, err.Error())
}
cfg.TelemetryRetention = time.Duration(retentionDays) * 24 * time.Hour
// Cascade flags — every one defaults false, so an unset env is today's bot.
for _, f := range []struct {
key string
dest *bool
}{
{"ROUTER_ENABLED", &cfg.RouterEnabled},
{"ROUTER_CLASSIFIER_ENABLED", &cfg.RouterClassifierEnabled},
{"TRIVIAL_OFFLOAD_ENABLED", &cfg.TrivialOffloadEnabled},
{"WEB_ENABLED", &cfg.WebEnabled},
{"REASONING_ENABLED", &cfg.ReasoningEnabled},
} {
if *f.dest, err = getenvBool(f.key, false); err != nil {
problems = append(problems, err.Error())
}
}
if cfg.WebGroundingDailyCap, err = getenvInt("WEB_GROUNDING_DAILY_CAP", 450); err != nil {
problems = append(problems, err.Error())
}
if cfg.CanaryPercent, err = getenvInt("CANARY_PERCENT", 0); err != nil {
problems = append(problems, err.Error())
}
// Gemini pricing → the per-model table (defaults: gemini-2.5-flash-lite $0.10/$0.40
// per 1M; cached priced as input, a conservative over-count). Lets the ceiling and
// request_log price Gemini calls at Gemini rates.
var gIn, gOut float64
if gIn, err = getenvFloat("GEMINI_PRICE_INPUT_PER_M", 0.10); err != nil {
problems = append(problems, err.Error())
}
if gOut, err = getenvFloat("GEMINI_PRICE_OUTPUT_PER_M", 0.40); err != nil {
problems = append(problems, err.Error())
}
cfg.Prices[cfg.GeminiModel] = ModelPrice{InputPerM: gIn, CachedPerM: gIn, OutputPerM: gOut}
// Reasoning model price (defaults to the final-voice grok rates — grok-4.3 ≈ 4.20),
// so the reasoning route reserves/bills at its own price instead of falling back.
var rIn, rOut float64
if rIn, err = getenvFloat("REASONING_PRICE_INPUT_PER_M", cfg.PriceInputPerM); err != nil {
problems = append(problems, err.Error())
}
if rOut, err = getenvFloat("REASONING_PRICE_OUTPUT_PER_M", cfg.PriceOutputPerM); err != nil {
problems = append(problems, err.Error())
}
cfg.Prices[cfg.ReasoningModel] = ModelPrice{InputPerM: rIn, CachedPerM: cfg.PriceCachedPerM, OutputPerM: rOut}
// Fail-fast on broken cascade wiring (§5/F-FUNC-9), at EVERY start (not just
// check-config): a layer that needs Gemini but has no key would silently never
// fire. Better to refuse to start than to quietly run degraded.
needsGemini := cfg.TrivialOffloadEnabled || cfg.RouterClassifierEnabled ||
(cfg.WebEnabled && cfg.WebProvider == webProviderGeminiGrounding)
if needsGemini && cfg.GeminiAPIKey == "" {
problems = append(problems, "GEMINI_API_KEY is required when TRIVIAL_OFFLOAD_ENABLED, ROUTER_CLASSIFIER_ENABLED, or WEB_ENABLED with gemini_grounding is set")
}
if cfg.RouterClassifierEnabled && !cfg.RouterEnabled {
problems = append(problems, "ROUTER_CLASSIFIER_ENABLED requires ROUTER_ENABLED")
}
if cfg.WebEnabled && cfg.WebProvider != webProviderGrokWebSearch && cfg.WebProvider != webProviderGeminiGrounding {
problems = append(problems, fmt.Sprintf("WEB_PROVIDER must be %q or %q, got %q",
webProviderGrokWebSearch, webProviderGeminiGrounding, cfg.WebProvider))
}
if cfg.ReasoningEnabled && cfg.ReasoningModel == "" {
problems = append(problems, "REASONING_MODEL is required when REASONING_ENABLED is set")
}
switch cfg.GrokReasoningEffort {
case "", "none", "low", "medium", "high":
default:
problems = append(problems, fmt.Sprintf(
"GROK_REASONING_EFFORT must be one of none/low/medium/high (or empty), got %q", cfg.GrokReasoningEffort))
}
switch cfg.ReasoningEffort {
case "none", "low", "medium", "high":
default:
problems = append(problems, fmt.Sprintf(
"REASONING_EFFORT must be one of none/low/medium/high, got %q", cfg.ReasoningEffort))
}
if len(problems) > 0 { if len(problems) > 0 {
return nil, fmt.Errorf("invalid configuration:\n - %s", strings.Join(problems, "\n - ")) return nil, fmt.Errorf("invalid configuration:\n - %s", strings.Join(problems, "\n - "))
@ -223,6 +442,14 @@ func LoadConfig() (*Config, error) {
return cfg, nil return cfg, nil
} }
// needsGemini reports whether any enabled layer requires the Gemini backend — the
// cheap trivial route, the Layer-1 classifier, or Gemini-native web grounding. Drives
// both the fail-fast key check and whether the client is built at all.
func (c *Config) needsGemini() bool {
return c.TrivialOffloadEnabled || c.RouterClassifierEnabled ||
(c.WebEnabled && c.WebProvider == webProviderGeminiGrounding)
}
// Summary returns a human-readable, SECRET-REDACTED dump for the startup log. // Summary returns a human-readable, SECRET-REDACTED dump for the startup log.
func (c *Config) Summary() string { func (c *Config) Summary() string {
servers := make([]string, 0, len(c.AllowedServers)) servers := make([]string, 0, len(c.AllowedServers))
@ -255,6 +482,12 @@ func (c *Config) Summary() string {
" HS_TOKEN = " + redact(c.HSToken), " HS_TOKEN = " + redact(c.HSToken),
" XAI_BASE_URL = " + c.XAIBaseURL, " XAI_BASE_URL = " + c.XAIBaseURL,
" XAI_MODEL = " + c.XAIModel, " XAI_MODEL = " + c.XAIModel,
" GROK_REASONING_EFFORT = " + func() string {
if c.GrokReasoningEffort == "" {
return "(unset — not sent; provider default)"
}
return c.GrokReasoningEffort
}(),
" XAI_API_KEY = " + redact(c.XAIAPIKey), " XAI_API_KEY = " + redact(c.XAIAPIKey),
fmt.Sprintf(" XAI_TEMPERATURE = %g", c.XAITemp), fmt.Sprintf(" XAI_TEMPERATURE = %g", c.XAITemp),
fmt.Sprintf(" MAX_OUTPUT_TOKENS = %d", c.MaxOutTok), fmt.Sprintf(" MAX_OUTPUT_TOKENS = %d", c.MaxOutTok),
@ -268,5 +501,14 @@ func (c *Config) Summary() string {
" SYSTEM_PROMPT_PATH = " + c.SystemPromptPath, " SYSTEM_PROMPT_PATH = " + c.SystemPromptPath,
" STATE_DIR = " + c.StateDir, " STATE_DIR = " + c.StateDir,
" AI_BOT_DATABASE_URL= " + redact(c.DatabaseURL), " AI_BOT_DATABASE_URL= " + redact(c.DatabaseURL),
fmt.Sprintf(" REQUEST_BUDGET = %s", c.RequestBudget),
fmt.Sprintf(" GROK_PROMPT_CACHE = %t", c.GrokPromptCache),
fmt.Sprintf(" TELEMETRY_ENABLED = %t (store_text=%t, retention=%s)",
c.TelemetryEnabled, c.TelemetryStoreText, c.TelemetryRetention),
fmt.Sprintf(" CASCADE: router=%t classifier=%t trivial=%t web=%t(%s, cap=%d) reason=%t(%s)",
c.RouterEnabled, c.RouterClassifierEnabled, c.TrivialOffloadEnabled,
c.WebEnabled, c.WebProvider, c.WebGroundingDailyCap, c.ReasoningEnabled, c.ReasoningEffort),
" GEMINI_MODEL = " + c.GeminiModel,
" GEMINI_API_KEY = " + redact(c.GeminiAPIKey),
}, "\n") }, "\n")
} }

View file

@ -0,0 +1,98 @@
package main
import (
"strings"
"testing"
)
// setBaseEnv sets the minimal valid environment (all cascade flags off) so each test
// can toggle one combination and assert the fail-fast validation (F-FUNC-9).
func setBaseEnv(t *testing.T) {
t.Helper()
t.Setenv("HOMESERVER_URL", "http://hs")
t.Setenv("BOT_MXID", "@ai:vojo.chat")
t.Setenv("AS_TOKEN", "as")
t.Setenv("HS_TOKEN", "hs")
t.Setenv("XAI_API_KEY", "xai")
t.Setenv("AI_BOT_DATABASE_URL", "postgres://x")
t.Setenv("ALLOWED_SERVERS", "vojo.chat")
// Force a clean baseline so the host environment can't leak in.
for _, k := range []string{
"GEMINI_API_KEY", "GEMINI_API_KEY_FILE", "ROUTER_ENABLED", "ROUTER_CLASSIFIER_ENABLED",
"TRIVIAL_OFFLOAD_ENABLED", "WEB_ENABLED", "REASONING_ENABLED", "WEB_PROVIDER", "REASONING_MODEL",
} {
t.Setenv(k, "")
}
}
func TestConfigBaseValid(t *testing.T) {
setBaseEnv(t)
if _, err := LoadConfig(); err != nil {
t.Fatalf("base config should be valid: %v", err)
}
}
func TestConfigAllCascadeFlagsDefaultOff(t *testing.T) {
setBaseEnv(t)
cfg, err := LoadConfig()
if err != nil {
t.Fatalf("%v", err)
}
if cfg.RouterEnabled || cfg.RouterClassifierEnabled || cfg.TrivialOffloadEnabled ||
cfg.WebEnabled || cfg.ReasoningEnabled || cfg.TelemetryEnabled || cfg.GrokPromptCache {
t.Fatal("every cascade/telemetry flag must default off (cascade-off == today)")
}
if cfg.WebProvider != webProviderGrokWebSearch {
t.Fatalf("default WEB_PROVIDER = %q, want grok_web_search", cfg.WebProvider)
}
}
func TestConfigTrivialNeedsGeminiKey(t *testing.T) {
setBaseEnv(t)
t.Setenv("TRIVIAL_OFFLOAD_ENABLED", "true")
if _, err := LoadConfig(); err == nil || !strings.Contains(err.Error(), "GEMINI_API_KEY") {
t.Fatalf("want GEMINI_API_KEY error, got %v", err)
}
t.Setenv("GEMINI_API_KEY", "gk")
if _, err := LoadConfig(); err != nil {
t.Fatalf("with key it should be valid: %v", err)
}
}
func TestConfigClassifierNeedsRouter(t *testing.T) {
setBaseEnv(t)
t.Setenv("GEMINI_API_KEY", "gk")
t.Setenv("ROUTER_CLASSIFIER_ENABLED", "true") // without ROUTER_ENABLED
if _, err := LoadConfig(); err == nil || !strings.Contains(err.Error(), "ROUTER_ENABLED") {
t.Fatalf("want ROUTER_ENABLED error, got %v", err)
}
}
func TestConfigBadWebProvider(t *testing.T) {
setBaseEnv(t)
t.Setenv("WEB_ENABLED", "true")
t.Setenv("WEB_PROVIDER", "bing")
if _, err := LoadConfig(); err == nil || !strings.Contains(err.Error(), "WEB_PROVIDER") {
t.Fatalf("want WEB_PROVIDER error, got %v", err)
}
}
// The default web provider (grok_web_search) uses the existing xAI key, so WEB_ENABLED
// alone must NOT demand a Gemini key.
func TestConfigWebGrokNeedsNoGeminiKey(t *testing.T) {
setBaseEnv(t)
t.Setenv("WEB_ENABLED", "true")
if _, err := LoadConfig(); err != nil {
t.Fatalf("web+grok_web_search should not need a Gemini key: %v", err)
}
}
// gemini_grounding DOES need a Gemini key.
func TestConfigWebGeminiGroundingNeedsKey(t *testing.T) {
setBaseEnv(t)
t.Setenv("WEB_ENABLED", "true")
t.Setenv("WEB_PROVIDER", webProviderGeminiGrounding)
if _, err := LoadConfig(); err == nil || !strings.Contains(err.Error(), "GEMINI_API_KEY") {
t.Fatalf("want GEMINI_API_KEY error, got %v", err)
}
}

View file

@ -9,20 +9,20 @@ type bufferedMsg struct {
isBot bool isBot bool
} }
// buildContext assembles the xAI message list under the owner's minimisation // buildContext assembles the provider-neutral message list under the owner's
// rule ("trigger + bot replies only", §6/F8): // minimisation rule ("trigger + bot replies only", §6/F8):
// //
// - GROUP rooms: send ONLY the bot's own prior replies (assistant turns) plus // - GROUP rooms: send ONLY the bot's own prior replies (assistant turns) plus
// the single triggering message (user turn). Other participants' messages and // the single triggering message (user turn). Other participants' messages and
// display names never reach xAI — the third-party-consent mitigation. // display names never reach the model — the third-party-consent mitigation.
// - 1:1 rooms: there are no third parties, so the peer's recent turns are // - 1:1 rooms: there are no third parties, so the peer's recent turns are
// included too for coherence. Still no display names (pseudo "user"). // included too for coherence. Still no display names (pseudo "user").
// //
// `history` is the recent room window EXCLUDING the trigger; `triggerBody` is the // `history` is the recent room window EXCLUDING the trigger; `triggerBody` is the
// message that addressed the bot. Bodies are stripped of reply-fallback quotes so // message that addressed the bot. Bodies are stripped of reply-fallback quotes so
// quoted third-party text doesn't leak. // quoted third-party text doesn't leak.
func buildContext(system string, history []bufferedMsg, isDM bool, triggerBody string, maxEvents, maxTokens int) []xaiMessage { func buildContext(system string, history []bufferedMsg, isDM bool, triggerBody string, maxEvents, maxTokens int) []Message {
msgs := []xaiMessage{{Role: "system", Content: system}} msgs := []Message{{Role: "system", Content: system}}
// Keep at most the last maxEvents history items. // Keep at most the last maxEvents history items.
if len(history) > maxEvents { if len(history) > maxEvents {
@ -34,16 +34,16 @@ func buildContext(system string, history []bufferedMsg, isDM bool, triggerBody s
continue continue
} }
if h.isBot { if h.isBot {
msgs = append(msgs, xaiMessage{Role: "assistant", Content: body}) msgs = append(msgs, Message{Role: "assistant", Content: body})
continue continue
} }
if isDM { if isDM {
msgs = append(msgs, xaiMessage{Role: "user", Content: body}) msgs = append(msgs, Message{Role: "user", Content: body})
} }
// group + non-bot history → dropped (privacy minimisation) // group + non-bot history → dropped (privacy minimisation)
} }
msgs = append(msgs, xaiMessage{Role: "user", Content: stripReplyFallback(triggerBody)}) msgs = append(msgs, Message{Role: "user", Content: stripReplyFallback(triggerBody)})
return truncateToTokens(msgs, maxTokens) return truncateToTokens(msgs, maxTokens)
} }
@ -57,7 +57,7 @@ func estimateTokens(s string) int {
// truncateToTokens drops the oldest non-system, non-final messages until the // truncateToTokens drops the oldest non-system, non-final messages until the
// estimate fits maxTokens. The system prompt (index 0) and the final user // estimate fits maxTokens. The system prompt (index 0) and the final user
// trigger are always preserved. // trigger are always preserved.
func truncateToTokens(msgs []xaiMessage, maxTokens int) []xaiMessage { func truncateToTokens(msgs []Message, maxTokens int) []Message {
total := 0 total := 0
for _, m := range msgs { for _, m := range msgs {
total += estimateTokens(m.Content) total += estimateTokens(m.Content)

200
apps/ai-bot/httpllm.go Normal file
View file

@ -0,0 +1,200 @@
package main
import (
"bytes"
"context"
"encoding/json"
"fmt"
"io"
"log/slog"
"math/rand"
"net/http"
"time"
)
// httpllm.go is the shared OpenAI-compatible Chat Completions transport: one
// HTTP+retry implementation reused by every provider adapter. Grok and Gemini both
// expose this wire format, so the retry/backoff classification (429/5xx/network =
// retryable, other 4xx = terminal) lives once here, parameterised by base/key/
// headers, instead of being copied per provider.
// openAIClient performs OpenAI-compatible /chat/completions calls with retry.
type openAIClient struct {
name string // provider label for logs/errors ("xai", "gemini")
base string
key string
http *http.Client
maxTry int
headers map[string]string // extra static headers (provider-specific), may be nil
log *slog.Logger
}
func newOpenAIClient(name, base, key string, headers map[string]string, logger *slog.Logger) *openAIClient {
return &openAIClient{
name: name,
base: base,
key: key,
http: &http.Client{},
maxTry: 3,
headers: headers,
log: logger,
}
}
// --- OpenAI-compatible wire types -------------------------------------------------
type openAIMessage struct {
Role string `json:"role"`
Content string `json:"content"`
}
// openAITool is the wire shape of a model tool (e.g. web search). Only serialized
// when the request carries tools, so a plain completion's body is unchanged.
type openAITool struct {
Type string `json:"type"`
}
type openAIRequest struct {
Model string `json:"model"`
Messages []openAIMessage `json:"messages"`
MaxTokens int `json:"max_tokens"`
Temperature float64 `json:"temperature"`
Stream bool `json:"stream"`
// Optional; omitempty keeps the grok_direct body byte-identical to before.
Tools []openAITool `json:"tools,omitempty"`
ReasoningEffort string `json:"reasoning_effort,omitempty"`
// SearchParameters drives xAI Live Search on chat/completions (the web route's
// grok_web_search provider). nil for every non-web call, so it serializes away.
SearchParameters any `json:"search_parameters,omitempty"`
}
type openAIUsage struct {
PromptTokens int `json:"prompt_tokens"`
CompletionTokens int `json:"completion_tokens"`
PromptTokensDetails struct {
CachedTokens int `json:"cached_tokens"`
} `json:"prompt_tokens_details"`
}
type openAIResponse struct {
ID string `json:"id"`
Choices []struct {
Message struct {
Content string `json:"content"`
} `json:"message"`
FinishReason string `json:"finish_reason"`
} `json:"choices"`
Usage openAIUsage `json:"usage"`
// Citations is the source list xAI Live Search returns by default (absent on a
// non-web call → nil).
Citations []string `json:"citations"`
}
func (r *openAIResponse) Text() string {
if len(r.Choices) == 0 {
return ""
}
return r.Choices[0].Message.Content
}
// complete calls Chat Completions with retry on transient failures (429 / 5xx /
// network timeout, exponential backoff + jitter). Non-retryable 4xx fail
// immediately. On exhaustion the caller refunds the reserved request and notifies
// the user, so a transient failure is never silently swallowed (F6). reqHeaders are
// per-request headers (e.g. x-grok-conv-id) merged on top of the static ones; nil is
// fine.
func (c *openAIClient) complete(ctx context.Context, reqBody openAIRequest, reqHeaders map[string]string) (*openAIResponse, error) {
payload, err := json.Marshal(reqBody)
if err != nil {
return nil, err
}
var lastErr error
for attempt := 0; attempt < c.maxTry; attempt++ {
if attempt > 0 {
// 0.5s, 1s, 2s … capped at 8s, plus up to 250ms jitter.
backoff := time.Duration(500<<uint(attempt-1)) * time.Millisecond
if backoff > 8*time.Second {
backoff = 8 * time.Second
}
backoff += time.Duration(rand.Intn(250)) * time.Millisecond
select {
case <-ctx.Done():
return nil, ctx.Err()
case <-time.After(backoff):
}
}
resp, retryable, err := c.attempt(ctx, payload, reqHeaders)
if err == nil {
return resp, nil
}
lastErr = err
if ctx.Err() != nil {
return nil, ctx.Err()
}
if !retryable {
return nil, err
}
if c.log != nil {
c.log.Warn(c.name+" attempt failed, will retry", "attempt", attempt+1, "max", c.maxTry, "err", err)
}
}
return nil, fmt.Errorf("%s: exhausted %d attempts: %w", c.name, c.maxTry, lastErr)
}
// attempt performs one HTTP call. It returns retryable=true for 429/5xx and
// network errors, false for other non-2xx (terminal 4xx). The per-attempt deadline
// bounds a single hung connection; the overall per-request deadline (set by the
// caller via ctx) bounds the whole retry loop so a cascade can't accrete minutes.
func (c *openAIClient) attempt(ctx context.Context, payload []byte, reqHeaders map[string]string) (*openAIResponse, bool, error) {
attemptCtx, cancel := context.WithTimeout(ctx, 60*time.Second)
defer cancel()
req, err := http.NewRequestWithContext(attemptCtx, http.MethodPost, c.base+"/chat/completions", bytes.NewReader(payload))
if err != nil {
return nil, false, err
}
req.Header.Set("Content-Type", "application/json")
req.Header.Set("Authorization", "Bearer "+c.key)
for k, v := range c.headers {
req.Header.Set(k, v)
}
for k, v := range reqHeaders {
req.Header.Set(k, v)
}
resp, err := c.http.Do(req)
if err != nil {
// Network error / timeout — retryable (unless the parent ctx is done).
return nil, ctx.Err() == nil, err
}
defer resp.Body.Close()
data, _ := io.ReadAll(resp.Body)
if resp.StatusCode == http.StatusTooManyRequests || resp.StatusCode >= 500 {
return nil, true, fmt.Errorf("%s http %d: %s", c.name, resp.StatusCode, snippet(data))
}
if resp.StatusCode < 200 || resp.StatusCode >= 300 {
return nil, false, fmt.Errorf("%s http %d: %s", c.name, resp.StatusCode, snippet(data))
}
var out openAIResponse
if err := json.Unmarshal(data, &out); err != nil {
return nil, false, fmt.Errorf("%s decode: %w", c.name, err)
}
// A 2xx is a billed call even when the model returns empty content (content
// filter, finish_reason=length with no text, or no choices). Return it as a
// success so the caller books the real cost via the ledger instead of refunding
// the slot and losing the spend — which would let empty replies bypass BOTH the
// per-user cap and the global ceiling. The caller just won't send an empty body.
return &out, false, nil
}
func snippet(b []byte) string {
const max = 300
if len(b) > max {
return string(b[:max]) + "…"
}
return string(b)
}

65
apps/ai-bot/llm.go Normal file
View file

@ -0,0 +1,65 @@
package main
import "context"
// llm.go is the provider-neutral seam between the bot's business logic and the
// concrete model backends. Nothing here names a vendor: the bot composes its
// context, prices usage, and books spend against these types, and a thin adapter
// (provider_xai.go, provider_gemini.go) maps them to/from each backend's wire
// format. This is what lets a second model (Gemini) slot in behind a flag without
// the business logic learning a new shape.
// Message is one provider-neutral chat turn.
type Message struct {
Role string // "system" | "user" | "assistant"
Content string
}
// Usage is the provider-neutral token accounting returned with a completion. It
// drives billing (computeUSD) — the counts are the API's own, authoritative even
// if our price constants drift.
type Usage struct {
PromptTokens int
CachedTokens int // subset of PromptTokens served from the provider's prompt cache
CompletionTokens int
}
// Tool is a provider-neutral tool the model may invoke (e.g. web search). Empty
// today; the web-freshness layer (Phase 3) populates it. Carried here so the
// request type is stable across phases.
type Tool struct {
// Type names the tool, e.g. "web_search". Adapters translate it to each
// backend's tool wire shape.
Type string
}
// LLMRequest is a provider-neutral completion request. New optional fields (Tools,
// ReasoningEffort) serialize away when empty, so a plain grok_direct call produces
// exactly the same wire body it did before this seam existed.
type LLMRequest struct {
Model string
Messages []Message
MaxTokens int
Temperature float64
Tools []Tool // optional; populated by the web layer
ReasoningEffort string // optional; "" = default, e.g. "low"|"high" for the reasoning route
// ConvID is an optional prompt-cache routing hint. Adapters that support it (xAI's
// x-grok-conv-id) pin a conversation to one backend to raise cache hit rate; "" =
// don't send it. It is a header, not part of the request body, so it never changes
// the wire body and an unset value is a no-op.
ConvID string
}
// LLMResponse is a provider-neutral completion result.
type LLMResponse struct {
Text string
Usage Usage
ProviderRequestID string // the backend's response id, logged for support/debug
}
// LLMClient is any chat-completion backend (Grok, Gemini, …). Implementations are
// thin adapters over a wire protocol; the bot depends only on this interface, so
// Bot.llm can be swapped or routed without touching business logic.
type LLMClient interface {
Complete(ctx context.Context, req LLMRequest) (*LLMResponse, error)
}

View file

@ -11,7 +11,6 @@ import (
"fmt" "fmt"
"os" "os"
"os/signal" "os/signal"
"path/filepath"
"syscall" "syscall"
) )
@ -91,8 +90,3 @@ func main() {
} }
logger.Info("shut down cleanly") logger.Info("shut down cleanly")
} }
// statePath joins a filename under the configured state directory.
func (c *Config) statePath(name string) string {
return filepath.Join(c.StateDir, name)
}

42
apps/ai-bot/pricing.go Normal file
View file

@ -0,0 +1,42 @@
package main
// pricing.go centralises model pricing as a per-model table (the LiteLLM pattern)
// instead of three hardcoded Grok fields. The spend ledger prices each call by the
// model it actually used, so when a second model (Gemini) starts answering some
// routes, its cost books correctly against the same global ceiling.
// ModelPrice is the per-1M-token USD price for one model, applied to the API's
// returned usage so the wallet ceiling tracks real cost even as prices change.
type ModelPrice struct {
InputPerM float64 // non-cached prompt tokens
CachedPerM float64 // prompt tokens served from the provider cache (cheaper)
OutputPerM float64 // completion tokens
}
// CostBreakdown is the per-component USD cost of answering one request. A plain
// grok_direct call has only Token; a cascade adds Router (the cheap classifier),
// Grounding (Gemini Google-search) and/or WebTool (Grok web search) on top. Settle
// books each column separately so the ledger and request_log can attribute spend,
// and so a half-finished cascade can book only what it actually spent (§8.1).
type CostBreakdown struct {
Token float64
Grounding float64
WebTool float64
Router float64
}
// Total is the grand total across all components (the number the wallet ceiling and
// request_log.total_usd care about). Computed, never stored, so it can't drift.
func (c CostBreakdown) Total() float64 {
return c.Token + c.Grounding + c.WebTool + c.Router
}
// priceFor returns the configured price for a model. An unknown model falls back to
// the default (final-voice) model's price rather than $0 — a $0 price would silently
// blind the global ceiling to that call, the one failure mode we never want.
func (c *Config) priceFor(model string) ModelPrice {
if p, ok := c.Prices[model]; ok {
return p
}
return c.Prices[c.XAIModel]
}

View file

@ -0,0 +1,189 @@
package main
import (
"bytes"
"context"
"encoding/json"
"fmt"
"io"
"log/slog"
"net/http"
"net/url"
"strings"
"time"
)
// provider_gemini.go is the Gemini backend. Two faces:
//
// - geminiClient: a thin LLMClient over the OpenAI-compatible endpoint, used for the
// cheap trivial route and the Layer-1 router classifier. Same wire format as Grok,
// so it reuses the shared transport (httpllm.go).
// - groundedSearch: a SEPARATE call against the NATIVE v1beta generateContent endpoint
// with the google_search tool. Grounding does NOT work on the OpenAI-compat layer
// (it is silently ignored there, and only on Gemini 3+) — verified against Google's
// docs (F-EXT-3) — so the web layer that wants Gemini grounding must use this native
// path and VERIFY citations came back, else degrade.
type geminiClient struct {
http *openAIClient
nativeBase string // …/v1beta — derived from the OpenAI-compat base by dropping /openai
key string
model string
httpc *http.Client
log *slog.Logger
}
// NewGeminiClient builds the Gemini backend. base is the OpenAI-compatible endpoint
// (…/v1beta/openai); the native grounding endpoint is derived from it. Returns the
// concrete type (not just LLMClient) because the web layer needs groundedSearch too.
func NewGeminiClient(base, key, model string, logger *slog.Logger) *geminiClient {
return &geminiClient{
http: newOpenAIClient("gemini", base, key, nil, logger),
nativeBase: strings.TrimSuffix(base, "/openai"),
key: key,
model: model,
httpc: &http.Client{},
log: logger,
}
}
// Complete answers via the OpenAI-compatible endpoint (trivial route + classifier).
func (c *geminiClient) Complete(ctx context.Context, req LLMRequest) (*LLMResponse, error) {
msgs := make([]openAIMessage, len(req.Messages))
for i, m := range req.Messages {
msgs[i] = openAIMessage{Role: m.Role, Content: m.Content}
}
resp, err := c.http.complete(ctx, openAIRequest{
Model: req.Model,
Messages: msgs,
MaxTokens: req.MaxTokens,
Temperature: req.Temperature,
Stream: false,
}, nil)
if err != nil {
return nil, err
}
return &LLMResponse{
Text: resp.Text(),
Usage: Usage{
PromptTokens: resp.Usage.PromptTokens,
CachedTokens: resp.Usage.PromptTokensDetails.CachedTokens,
CompletionTokens: resp.Usage.CompletionTokens,
},
ProviderRequestID: resp.ID,
}, nil
}
// --- native v1beta grounded search (google_search tool) ---------------------------
type geminiGroundResult struct {
Digest string
Citations []string
Usage Usage
}
// native generateContent wire types (only the fields we read/write).
type geminiNativeRequest struct {
Contents []geminiContent `json:"contents"`
Tools []geminiTool `json:"tools"`
}
type geminiContent struct {
Role string `json:"role,omitempty"`
Parts []geminiPart `json:"parts"`
}
type geminiPart struct {
Text string `json:"text"`
}
type geminiTool struct {
// google_search is the current grounding tool (Gemini 3 / current models). The
// empty object enables it.
GoogleSearch struct{} `json:"google_search"`
}
type geminiNativeResponse struct {
Candidates []struct {
Content struct {
Parts []geminiPart `json:"parts"`
} `json:"content"`
GroundingMetadata struct {
GroundingChunks []struct {
Web struct {
URI string `json:"uri"`
Title string `json:"title"`
} `json:"web"`
} `json:"groundingChunks"`
} `json:"groundingMetadata"`
} `json:"candidates"`
UsageMetadata struct {
PromptTokenCount int `json:"promptTokenCount"`
CandidatesTokenCount int `json:"candidatesTokenCount"`
CachedContentTokenCount int `json:"cachedContentTokenCount"`
} `json:"usageMetadata"`
}
// groundedSearch runs one grounded generateContent against the native endpoint and
// returns the model's grounded answer plus the source URLs. It REQUIRES citations:
// if groundingMetadata has no chunks the request was not actually grounded (the
// silent-ignore failure mode, F-EXT-3), so it errors and the caller degrades rather
// than passing off ungrounded — possibly stale — text as fresh.
func (c *geminiClient) groundedSearch(ctx context.Context, query string) (geminiGroundResult, error) {
body, err := json.Marshal(geminiNativeRequest{
Contents: []geminiContent{{Role: "user", Parts: []geminiPart{{Text: query}}}},
Tools: []geminiTool{{}},
})
if err != nil {
return geminiGroundResult{}, err
}
// API key in the query string is the native v1beta convention.
endpoint := fmt.Sprintf("%s/models/%s:generateContent?key=%s",
c.nativeBase, url.PathEscape(c.model), url.QueryEscape(c.key))
reqCtx, cancel := context.WithTimeout(ctx, 15*time.Second) // web/grounding budget (§8.2.2)
defer cancel()
req, err := http.NewRequestWithContext(reqCtx, http.MethodPost, endpoint, bytes.NewReader(body))
if err != nil {
return geminiGroundResult{}, err
}
req.Header.Set("Content-Type", "application/json")
resp, err := c.httpc.Do(req)
if err != nil {
return geminiGroundResult{}, err
}
defer resp.Body.Close()
data, _ := io.ReadAll(resp.Body)
if resp.StatusCode < 200 || resp.StatusCode >= 300 {
return geminiGroundResult{}, fmt.Errorf("gemini grounding http %d: %s", resp.StatusCode, snippet(data))
}
var out geminiNativeResponse
if err := json.Unmarshal(data, &out); err != nil {
return geminiGroundResult{}, fmt.Errorf("gemini grounding decode: %w", err)
}
if len(out.Candidates) == 0 {
return geminiGroundResult{}, fmt.Errorf("gemini grounding: no candidates")
}
var sb strings.Builder
for _, p := range out.Candidates[0].Content.Parts {
sb.WriteString(p.Text)
}
var citations []string
for _, ch := range out.Candidates[0].GroundingMetadata.GroundingChunks {
if ch.Web.URI != "" {
citations = append(citations, ch.Web.URI)
}
}
// The verify-gate: no citations ⇒ not actually grounded ⇒ degrade.
if len(citations) == 0 {
return geminiGroundResult{}, fmt.Errorf("gemini grounding: no citations (ungrounded — degrade)")
}
return geminiGroundResult{
Digest: strings.TrimSpace(sb.String()),
Citations: citations,
Usage: Usage{
PromptTokens: out.UsageMetadata.PromptTokenCount,
CachedTokens: out.UsageMetadata.CachedContentTokenCount,
CompletionTokens: out.UsageMetadata.CandidatesTokenCount,
},
}, nil
}

View file

@ -0,0 +1,62 @@
package main
import (
"context"
"log/slog"
)
// provider_xai.go is the thin adapter for xAI's Grok backend. xAI speaks the
// OpenAI Chat Completions wire format, so this is a shell over the shared
// openAIClient transport (httpllm.go): it only maps the neutral LLMRequest/
// LLMResponse to/from the wire types. Any xAI-specific request shaping would live
// here, but Grok needs none today.
type xaiClient struct {
http *openAIClient
}
// NewXAIClient builds the Grok backend. Returns the neutral LLMClient so the bot
// holds no vendor type.
func NewXAIClient(base, key string, logger *slog.Logger) LLMClient {
return &xaiClient{http: newOpenAIClient("xai", base, key, nil, logger)}
}
func (c *xaiClient) Complete(ctx context.Context, req LLMRequest) (*LLMResponse, error) {
msgs := make([]openAIMessage, len(req.Messages))
for i, m := range req.Messages {
msgs[i] = openAIMessage{Role: m.Role, Content: m.Content}
}
var tools []openAITool
for _, t := range req.Tools {
tools = append(tools, openAITool{Type: t.Type})
}
// x-grok-conv-id pins this conversation to one backend to raise the prompt-cache
// hit rate (caching itself is automatic on xAI). Only sent when set, so the
// default path adds no header.
var headers map[string]string
if req.ConvID != "" {
headers = map[string]string{"x-grok-conv-id": req.ConvID}
}
resp, err := c.http.complete(ctx, openAIRequest{
Model: req.Model,
Messages: msgs,
MaxTokens: req.MaxTokens,
Temperature: req.Temperature,
Stream: false,
Tools: tools,
ReasoningEffort: req.ReasoningEffort,
}, headers)
if err != nil {
return nil, err
}
return &LLMResponse{
Text: resp.Text(),
Usage: Usage{
PromptTokens: resp.Usage.PromptTokens,
CachedTokens: resp.Usage.PromptTokensDetails.CachedTokens,
CompletionTokens: resp.Usage.CompletionTokens,
},
ProviderRequestID: resp.ID,
}, nil
}

180
apps/ai-bot/router.go Normal file
View file

@ -0,0 +1,180 @@
package main
import (
"context"
"encoding/json"
"regexp"
"strings"
)
// router.go classifies a message into a route. It runs INSIDE respond() — after the
// mention/media/foreign/single-flight gates (F-FUNC-7) — so a paid Layer-1 classifier
// is never spent on a message today's bot drops for free.
//
// Two layers, both conservative (doubt → grok_direct, the safe floor that keeps
// substantive questions on Grok, §8.6):
// - Layer-0: free regex heuristics (RU+EN). Always runs when ROUTER_ENABLED.
// - Layer-1: a cheap Gemini JSON classifier, consulted ONLY on Layer-0 grok_direct
// when ROUTER_CLASSIFIER_ENABLED. Any failure falls back to the Layer-0 verdict.
// RouterDecision is the route plus the signals behind it (logged for threshold
// calibration). Only Route/Source/Confidence/NeedsWeb drive behaviour today; the rest
// are recorded for the offline router-replay eval (§9).
type RouterDecision struct {
Route string
Source string // heuristic | classifier | default | forced | degraded
Confidence float64
NeedsWeb bool
Freshness string
ReasoningLevel string
Domain string
Difficulty string
}
// Heuristic patterns. Kept deliberately tight: a false "trivial" leaks a real question
// to the cheap model, so trivial fires only on short, unmistakable greetings/acks or
// bare arithmetic. Freshness words route to web (a false web-route only costs a fetch
// and degrades cleanly — never a wrong answer).
var (
greetingRe = regexp.MustCompile(`^(привет(ик)?|здравствуй(те)?|хай|прив|ку|добрый\s+(день|вечер|утро)|спасибо|спс|благодарю|пока|ок(ей)?|угу|ага|hello|hi|hey|yo|thanks|thank\s+you|thx|ty|bye|goodbye|ok|okay|cool|nice)[\s!.,)]*$`)
arithmeticRe = regexp.MustCompile(`^[\s(]*\d+(\s*[-+*/×÷]\s*\d+)+[\s)=?]*$`)
freshnessRe = regexp.MustCompile(`(новост|сегодня|сейчас|последн|курс\s|погод|котировк|расписани|прогноз|breaking|today|right now|latest|current(ly)?|news|weather|stock price|exchange rate|score)`)
)
// routeLayer0 is the free heuristic. Confidence is a rough self-estimate used only for
// logging/threshold tuning, not control flow.
func routeLayer0(body string) RouterDecision {
s := strings.ToLower(strings.TrimSpace(body))
if s == "" {
return RouterDecision{Route: routeGrokDirect, Source: "heuristic", Confidence: 0.5}
}
if freshnessRe.MatchString(s) {
return RouterDecision{Route: routeWebThenGrok, Source: "heuristic", Confidence: 0.7, NeedsWeb: true, Freshness: "recent"}
}
if isTrivial(s) {
return RouterDecision{Route: routeTrivial, Source: "heuristic", Confidence: 0.85, Difficulty: "trivial"}
}
return RouterDecision{Route: routeGrokDirect, Source: "heuristic", Confidence: 0.6}
}
// isTrivial: a short greeting/ack or a bare arithmetic expression, with no sign of a
// real question. Length-bounded so "thanks, now explain quantum tunnelling" is NOT
// trivial.
func isTrivial(s string) bool {
if arithmeticRe.MatchString(s) {
return true
}
if len(strings.Fields(s)) <= 4 && greetingRe.MatchString(s) {
return true
}
return false
}
// classify produces the final RouterDecision for a request. The manual reasoning
// trigger is honoured independently of the heuristic router (it's a deliberate user
// signal). Layer-1's cost, when it runs, is accumulated into cost.Router.
func (b *Bot) classify(ctx context.Context, body string, cost *CostBreakdown) RouterDecision {
if b.cfg.ReasoningEnabled && containsTrigger(body, b.cfg.ReasoningTrigger) {
return RouterDecision{Route: routeReason, Source: "forced", Confidence: 1, ReasoningLevel: "high"}
}
if !b.cfg.RouterEnabled {
return RouterDecision{Route: routeGrokDirect, Source: "default"}
}
d := routeLayer0(body)
// Layer-1 only refines the uncertain grok_direct verdict, and only if enabled and
// the Gemini client exists. Anything else stands on the heuristic.
if d.Route != routeGrokDirect || !b.cfg.RouterClassifierEnabled || b.gemini == nil {
return d
}
refined, err := b.routeLayer1(ctx, body, cost)
if err != nil {
b.log.Warn("layer-1 classifier failed; using heuristic", "err", err)
return d // degrade to the heuristic verdict
}
return refined
}
// classifierConfidenceFloor is the bar a Layer-1 escalation OFF the safe floor
// (trivial/web/reason) must clear. Below it, the verdict is treated as doubt and the
// request stays on grok_direct — the owner's "substantive stays on Grok" rule (§8.6).
// A low-confidence "trivial" is exactly the false-trivial voice leak we must not take.
const classifierConfidenceFloor = 0.8
// classifierPrompt asks Gemini for a strict JSON verdict. Kept terse to bound tokens.
const classifierPrompt = `You are a router. Classify the user message into exactly one route and reply with ONLY a JSON object, no prose.
Routes: "trivial" (greeting/ack/tiny arithmetic), "web" (needs fresh/current facts: news, prices, weather, "today"), "normal" (everything else).
Schema: {"route":"trivial|web|normal","confidence":0.0-1.0,"needs_web":true|false}
Message: `
// routeLayer1 runs the Gemini classifier and parses its JSON. A non-JSON or unknown
// answer is an error so classify() degrades to the heuristic — the cheap model never
// gets to silently mis-route by returning garbage.
func (b *Bot) routeLayer1(ctx context.Context, body string, cost *CostBreakdown) (RouterDecision, error) {
resp, err := b.gemini.Complete(ctx, LLMRequest{
Model: b.cfg.GeminiModel,
Messages: []Message{{Role: "user", Content: classifierPrompt + body}},
MaxTokens: 60,
Temperature: 0,
})
if err != nil {
return RouterDecision{}, err
}
cost.Router += computeUSD(b.cfg.GeminiModel, resp.Usage, b.cfg)
var parsed struct {
Route string `json:"route"`
Confidence float64 `json:"confidence"`
NeedsWeb bool `json:"needs_web"`
}
if err := json.Unmarshal([]byte(extractJSON(resp.Text)), &parsed); err != nil {
return RouterDecision{}, err
}
route := normalizeRoute(parsed.Route)
// Safe floor: a low-confidence escalation off grok_direct is doubt — keep it on
// Grok rather than leak a possibly-substantive question to the cheap model.
if route != routeGrokDirect && parsed.Confidence < classifierConfidenceFloor {
return RouterDecision{Route: routeGrokDirect, Source: "classifier", Confidence: parsed.Confidence}, nil
}
return RouterDecision{
Route: route,
Source: "classifier",
Confidence: parsed.Confidence,
NeedsWeb: parsed.NeedsWeb || route == routeWebThenGrok,
}, nil
}
// normalizeRoute maps a classifier label to a route constant, defaulting unknown
// labels to grok_direct — the safe floor, so a confused classifier never escalates.
func normalizeRoute(label string) string {
switch strings.ToLower(strings.TrimSpace(label)) {
case "trivial", "trivial_direct":
return routeTrivial
case "web", "web_then_grok":
return routeWebThenGrok
case "reason", "reason_then_grok":
return routeReason
default:
return routeGrokDirect
}
}
// extractJSON pulls the first {...} object out of a model reply, tolerating prose or
// code fences around it. Returns "" if none (→ a parse error → degrade).
func extractJSON(s string) string {
i := strings.IndexByte(s, '{')
j := strings.LastIndexByte(s, '}')
if i < 0 || j < i {
return ""
}
return s[i : j+1]
}
// containsTrigger reports whether body contains the manual trigger phrase
// (case-insensitive, whitespace-trimmed). Empty trigger never matches.
func containsTrigger(body, trigger string) bool {
trigger = strings.TrimSpace(strings.ToLower(trigger))
if trigger == "" {
return false
}
return strings.Contains(strings.ToLower(body), trigger)
}

View file

@ -0,0 +1,81 @@
package main
import "testing"
// TestRouteLayer0 is the heuristic golden set (RU+EN). The critical property is the
// safe floor: anything substantive must land on grok_direct, and a long message that
// merely starts with a greeting must NOT be trivial (no leaking real questions to the
// cheap model, §8.6).
func TestRouteLayer0(t *testing.T) {
cases := []struct {
body string
want string
}{
// trivial: short greetings/acks and bare arithmetic
{"привет", routeTrivial},
{"Привет!", routeTrivial},
{"спасибо", routeTrivial},
{"спс", routeTrivial},
{"ок", routeTrivial},
{"hi", routeTrivial},
{"hello", routeTrivial},
{"thanks", routeTrivial},
{"thank you", routeTrivial},
{"ok", routeTrivial},
{"2+2", routeTrivial},
{"100 * 50", routeTrivial},
{"12 / 4 - 1", routeTrivial},
// web: freshness signals
{"какие новости сегодня?", routeWebThenGrok},
{"что сейчас происходит в мире", routeWebThenGrok},
{"курс доллара сегодня", routeWebThenGrok},
{"what's the weather today", routeWebThenGrok},
{"latest news on AI", routeWebThenGrok},
{"current bitcoin price", routeWebThenGrok},
// grok_direct: substantive (the safe floor)
{"посоветуй фильм на вечер", routeGrokDirect},
{"explain how TCP works", routeGrokDirect},
{"расскажи историю римской империи", routeGrokDirect},
{"спасибо, а теперь подробно объясни квантовую запутанность", routeGrokDirect}, // starts w/ ack but long
{"hi, can you help me debug this Go program?", routeGrokDirect}, // starts w/ hi but a real ask
{"напиши функцию сортировки на python", routeGrokDirect},
}
for _, c := range cases {
if got := routeLayer0(c.body).Route; got != c.want {
t.Errorf("routeLayer0(%q) = %q, want %q", c.body, got, c.want)
}
}
}
func TestNormalizeRoute(t *testing.T) {
cases := map[string]string{
"trivial": routeTrivial, "web": routeWebThenGrok, "reason": routeReason,
"normal": routeGrokDirect, "garbage": routeGrokDirect, "": routeGrokDirect,
}
for in, want := range cases {
if got := normalizeRoute(in); got != want {
t.Errorf("normalizeRoute(%q) = %q, want %q", in, got, want)
}
}
}
func TestExtractJSON(t *testing.T) {
if got := extractJSON("prefix {\"route\":\"web\"} suffix"); got != `{"route":"web"}` {
t.Errorf("extractJSON = %q", got)
}
if got := extractJSON("no json here"); got != "" {
t.Errorf("extractJSON(no json) = %q, want empty", got)
}
}
func TestContainsTrigger(t *testing.T) {
if !containsTrigger("ну подумай глубже про это", "подумай глубже") {
t.Error("should match trigger phrase mid-sentence")
}
if containsTrigger("just a normal question", "подумай глубже") {
t.Error("must not match when phrase absent")
}
if containsTrigger("anything", "") {
t.Error("empty trigger must never match")
}
}

View file

@ -2,6 +2,7 @@ package main
import ( import (
"context" "context"
"encoding/json"
"errors" "errors"
"fmt" "fmt"
"time" "time"
@ -107,6 +108,64 @@ var migrations = []string{
PRIMARY KEY (date, mxid) PRIMARY KEY (date, mxid)
); );
CREATE TABLE IF NOT EXISTS warned_encrypted (room_id TEXT PRIMARY KEY);`, CREATE TABLE IF NOT EXISTS warned_encrypted (room_id TEXT PRIMARY KEY);`,
// v2: component cost columns + the optimistic reservation column. `reserved_usd`
// holds the estimated max-cost of in-flight calls so the global ceiling counts
// committed + reserved spend at admission time (the TOCTOU fix, §8.1): without it
// a burst of concurrent calls all read the same low committed SUM and slip past
// the ceiling, because the USD only lands at settle, AFTER the call. The component
// columns let the ceiling see grounding/tool fees too (not just tokens), and feed
// the per-component analytics. ADD COLUMN IF NOT EXISTS is idempotent.
`ALTER TABLE spend ADD COLUMN IF NOT EXISTS reserved_usd DOUBLE PRECISION NOT NULL DEFAULT 0;
ALTER TABLE spend ADD COLUMN IF NOT EXISTS router_usd DOUBLE PRECISION NOT NULL DEFAULT 0;
ALTER TABLE spend ADD COLUMN IF NOT EXISTS grounding_usd DOUBLE PRECISION NOT NULL DEFAULT 0;
ALTER TABLE spend ADD COLUMN IF NOT EXISTS webtool_usd DOUBLE PRECISION NOT NULL DEFAULT 0;`,
// v3: request_log — one row per engaged request, for offline analysis of the route
// mix, per-component $/day, latency, escalation/degrade rates (§6.2). Operational,
// not message content: query_text is written ONLY when TELEMETRY_STORE_TEXT is on.
// Indexed by ts for the time-based retention trim and time-series queries.
`CREATE TABLE IF NOT EXISTS request_log (
id TEXT PRIMARY KEY,
ts TIMESTAMPTZ NOT NULL DEFAULT now(),
room_id TEXT,
sender TEXT,
route TEXT,
router_source TEXT,
router_confidence REAL,
models JSONB,
prompt_tokens INT,
cached_tokens INT,
completion_tokens INT,
token_usd DOUBLE PRECISION,
grounding_usd DOUBLE PRECISION,
router_usd DOUBLE PRECISION,
webtool_usd DOUBLE PRECISION,
total_usd DOUBLE PRECISION,
latency_ms INT,
stage_ms JSONB,
escalated BOOL DEFAULT false,
fallback_fired BOOL DEFAULT false,
cache_hit BOOL DEFAULT false,
ceiling_hit BOOL DEFAULT false,
per_user_cap_hit BOOL DEFAULT false,
prompt_version TEXT,
provider_request_id TEXT,
degraded TEXT DEFAULT '',
err TEXT DEFAULT '',
ok BOOL,
query_text TEXT
);
CREATE INDEX IF NOT EXISTS request_log_ts_idx ON request_log (ts);`,
// v4: per-day grounded-prompt counter for the web grounding cap guard (§8.2.3). One
// row per UTC day; the cap check + increment is one atomic statement (same TOCTOU
// discipline as the spend gate), so a burst can't blow past the $/1k grounding
// overage. Day-keyed, so it self-resets and needs no separate trim.
`CREATE TABLE IF NOT EXISTS grounding_count (
date TEXT PRIMARY KEY,
n INTEGER NOT NULL DEFAULT 0
);`,
} }
// migrate runs all pending migrations on a single connection under a session // migrate runs all pending migrations on a single connection under a session
@ -207,13 +266,19 @@ func (s *Store) SeenEvent(eventID string) (bool, error) {
return true, err return true, err
} }
// SpentTodayUSD sums all spend for the current UTC day. SUM over no rows is NULL, // committedUSDExpr sums every COMMITTED cost component of a spend row — tokens plus
// which scans into a nil *float64 → treated as 0. // the grounding/web/router fees a cascade can incur — so the wallet ceiling is never
// blind to non-token spend. It deliberately excludes reserved_usd (that is in-flight,
// not yet spent); the admission gate adds reserved separately.
const committedUSDExpr = `usd + router_usd + grounding_usd + webtool_usd`
// SpentTodayUSD sums all COMMITTED spend for the current UTC day. SUM over no rows is
// NULL, which scans into a nil *float64 → treated as 0.
func (s *Store) SpentTodayUSD() (float64, error) { func (s *Store) SpentTodayUSD() (float64, error) {
ctx, cancel := opContext() ctx, cancel := opContext()
defer cancel() defer cancel()
var v *float64 var v *float64
if err := s.pool.QueryRow(ctx, `SELECT SUM(usd) FROM spend WHERE date = $1`, todayUTC()).Scan(&v); err != nil { if err := s.pool.QueryRow(ctx, `SELECT SUM(`+committedUSDExpr+`) FROM spend WHERE date = $1`, todayUTC()).Scan(&v); err != nil {
return 0, err return 0, err
} }
if v == nil { if v == nil {
@ -222,19 +287,30 @@ func (s *Store) SpentTodayUSD() (float64, error) {
return *v, nil return *v, nil
} }
// Reserve runs the two independent gates in one transaction, BEFORE the xAI call // reserveDayLockKey namespaces the per-day admission lock so it can't collide with
// (F4): the global USD ceiling protects the wallet; the per-user request cap is // the migration lock or any other advisory lock.
// anti-abuse. It increments the per-user request count on success; the USD is const reserveDayLockKey = "ai-bot:reserve:"
// reconciled after the response. Order: global first (cheapest to deny), then
// per-user. // Reserve runs the two admission gates in one transaction, BEFORE the call (F4): the
// global USD ceiling protects the wallet; the per-user request cap is anti-abuse. On
// success it both increments the per-user request count AND books `estimate` (the
// route's max-cost) into reserved_usd, so the global gate counts committed + reserved
// spend. The actual USD is settled after the response (Settle), at which point the
// reservation is released and the real cost booked. Order: global first (cheapest to
// deny), then per-user.
// //
// A transaction-scoped advisory lock on (date, mxid) serializes concurrent // The check-and-reserve is serialized GLOBALLY for the day by a transaction-scoped
// reservations for the SAME user+day, so the per-user check-then-increment stays // advisory lock keyed on the date (not on date|mxid as the bare port did). This is
// atomic. The former SQLite store got this for free (one connection serialized all // the TOCTOU fix (§8.1): the ceiling reads SUM(committed)+SUM(reserved) and then adds
// callers); the pgx pool is concurrent, and the same user messaging from two rooms // its own reservation atomically, so a burst of DIFFERENT users can overshoot the
// at once would otherwise be able to slip past the per-user cap. Different users // ceiling by at most ONE max-reservation rather than slipping through unbounded — the
// never contend. // per-(date,mxid) lock only serialized one user with himself and left the cross-user
func (s *Store) Reserve(mxid string, perUserCap int, dailyUSDCeiling float64) (reserveResult, error) { // ceiling unprotected. The former SQLite store serialized ALL callers on its single
// connection anyway, so this restores that exact admission semantics, durably; the
// bot is low-volume with per-room single-flight, so a per-day admission lock costs
// nothing observable. Settle/Release run lock-free (they only release/convert spend,
// never admit).
func (s *Store) Reserve(mxid string, perUserCap int, perUserUSD, dailyUSDCeiling, estimate float64) (reserveResult, error) {
ctx, cancel := opContext() ctx, cancel := opContext()
defer cancel() defer cancel()
day := todayUTC() day := todayUTC()
@ -245,41 +321,50 @@ func (s *Store) Reserve(mxid string, perUserCap int, dailyUSDCeiling float64) (r
} }
defer tx.Rollback(ctx) defer tx.Rollback(ctx)
// Key on date|mxid. The separator only needs to avoid cross-key ambiguity; a if _, err := tx.Exec(ctx, `SELECT pg_advisory_xact_lock(hashtextextended($1, 0))`, reserveDayLockKey+day); err != nil {
// hash collision would merely over-serialize two unrelated users, never corrupt a
// count. (NUL is rejected by Postgres text, so use a printable separator.)
if _, err := tx.Exec(ctx, `SELECT pg_advisory_xact_lock(hashtextextended($1, 0))`, day+"|"+mxid); err != nil {
return reserveOK, err return reserveOK, err
} }
// SUM over zero rows is NULL → nil pointer → treat as 0.0, exactly as the SQLite // committed + reserved. SUM over zero rows is NULL → nil pointer → treat as 0.0,
// store's sql.NullFloat64 did (and as SpentTodayUSD does). This keeps the gate 1:1 // exactly as the SQLite store's sql.NullFloat64 did. This keeps the gate 1:1 even
// even at the degenerate dailyUSDCeiling == 0 (deny everything), where 0 >= 0. // at the degenerate dailyUSDCeiling == 0 (deny everything), where 0 >= 0.
var global *float64 var inFlight *float64
if err := tx.QueryRow(ctx, `SELECT SUM(usd) FROM spend WHERE date = $1`, day).Scan(&global); err != nil { if err := tx.QueryRow(ctx,
`SELECT SUM(`+committedUSDExpr+` + reserved_usd) FROM spend WHERE date = $1`, day).Scan(&inFlight); err != nil {
return reserveOK, err return reserveOK, err
} }
spentToday := 0.0 spentToday := 0.0
if global != nil { if inFlight != nil {
spentToday = *global spentToday = *inFlight
} }
if spentToday >= dailyUSDCeiling { if spentToday >= dailyUSDCeiling {
return reserveDeniedGlobal, nil return reserveDeniedGlobal, nil
} }
// Per-user row: read requests AND the user's own committed+reserved $ in one go, so
// both per-user gates are checked under the same lock. ErrNoRows → first request of
// the day for this user → all zero.
var requests int var requests int
err = tx.QueryRow(ctx, `SELECT requests FROM spend WHERE date = $1 AND mxid = $2`, day, mxid).Scan(&requests) var userUSD float64
err = tx.QueryRow(ctx,
`SELECT requests, `+committedUSDExpr+` + reserved_usd FROM spend WHERE date = $1 AND mxid = $2`,
day, mxid).Scan(&requests, &userUSD)
if err != nil && !errors.Is(err, pgx.ErrNoRows) { if err != nil && !errors.Is(err, pgx.ErrNoRows) {
return reserveOK, err return reserveOK, err
} }
if requests >= perUserCap { if requests >= perUserCap {
return reserveDeniedUser, nil return reserveDeniedUser, nil
} }
// Optional per-user $ quota (0 = off): keep one user from draining the shared ceiling.
if perUserUSD > 0 && userUSD >= perUserUSD {
return reserveDeniedUser, nil
}
if _, err := tx.Exec(ctx, if _, err := tx.Exec(ctx,
`INSERT INTO spend (date, mxid, requests, usd) VALUES ($1, $2, 1, 0) `INSERT INTO spend (date, mxid, requests, reserved_usd) VALUES ($1, $2, 1, $3)
ON CONFLICT (date, mxid) DO UPDATE SET requests = spend.requests + 1`, ON CONFLICT (date, mxid) DO UPDATE SET requests = spend.requests + 1,
day, mxid); err != nil { reserved_usd = spend.reserved_usd + excluded.reserved_usd`,
day, mxid, estimate); err != nil {
return reserveOK, err return reserveOK, err
} }
if err := tx.Commit(ctx); err != nil { if err := tx.Commit(ctx); err != nil {
@ -288,10 +373,11 @@ func (s *Store) Reserve(mxid string, perUserCap int, dailyUSDCeiling float64) (r
return reserveOK, nil return reserveOK, nil
} }
// RefundRequest gives back a reserved request slot when the call ultimately // RefundRequest gives back a reserved request SLOT when the call ultimately failed
// failed (e.g. an xAI outage), so a transient failure doesn't burn the user's // (an outage) or the reply couldn't be delivered (paid silence, §8.1), so a transient
// daily cap. Never drops below zero. A single UPDATE is atomic, so concurrent // failure doesn't burn the user's daily cap. It does NOT touch USD: a 2xx is really
// refunds settle correctly without extra locking. // billed even if we then fail to deliver. Never drops below zero. A single UPDATE is
// atomic, so concurrent refunds settle correctly without extra locking.
func (s *Store) RefundRequest(mxid string) error { func (s *Store) RefundRequest(mxid string) error {
ctx, cancel := opContext() ctx, cancel := opContext()
defer cancel() defer cancel()
@ -301,19 +387,128 @@ func (s *Store) RefundRequest(mxid string) error {
return err return err
} }
// Reconcile books the actual USD cost of a completed call against the user's // ReleaseReservation frees a reservation whose request produced no billable spend,
// daily row (and thus the global total). The accumulating upsert is atomic and // restoring the global headroom without booking anything. The normal failure paths
// commutative, so concurrent reconciles for the same user sum correctly. // settle via Settle (which also releases), so this is the safety valve for an
func (s *Store) Reconcile(mxid string, usd float64) error { // UNSETTLED exit — a panic in generation, recovered by safego — where it runs with
// RefundRequest in respond's deferred guard so a leaked reservation can't drift the
// ceiling. GREATEST(0, …) guards against a double-release driving reserved_usd
// negative. Lock-free: it only lowers the in-flight reserved total, never admits.
func (s *Store) ReleaseReservation(mxid string, estimate float64) error {
ctx, cancel := opContext() ctx, cancel := opContext()
defer cancel() defer cancel()
_, err := s.pool.Exec(ctx, _, err := s.pool.Exec(ctx,
`INSERT INTO spend (date, mxid, requests, usd) VALUES ($1, $2, 0, $3) `UPDATE spend SET reserved_usd = GREATEST(0, reserved_usd - $3) WHERE date = $1 AND mxid = $2`,
ON CONFLICT (date, mxid) DO UPDATE SET usd = spend.usd + excluded.usd`, todayUTC(), mxid, estimate)
todayUTC(), mxid, usd)
return err return err
} }
// Settle releases a call's reservation and books its ACTUAL cost in one atomic step
// (replacing the old additive Reconcile): reserved_usd drops by the reservation while
// the real per-component cost is added to the committed columns. This is non-additive
// on the reservation (settle, not accumulate), the semantics the ceiling needs. It is
// also the partial-cascade-refund primitive (§8.1): a web_then_grok call that paid
// grounding but failed at the final model passes a CostBreakdown carrying only the
// grounding it actually spent, releases the rest of the reservation, and refunds the
// request slot separately. GREATEST(0, …) keeps reserved_usd from underflowing.
// Atomic and commutative per row, so concurrent settles for one user sum correctly.
func (s *Store) Settle(mxid string, estimate float64, cost CostBreakdown) error {
ctx, cancel := opContext()
defer cancel()
_, err := s.pool.Exec(ctx,
`INSERT INTO spend (date, mxid, requests, usd, router_usd, grounding_usd, webtool_usd, reserved_usd)
VALUES ($1, $2, 0, $3, $4, $5, $6, 0)
ON CONFLICT (date, mxid) DO UPDATE SET
usd = spend.usd + excluded.usd,
router_usd = spend.router_usd + excluded.router_usd,
grounding_usd = spend.grounding_usd + excluded.grounding_usd,
webtool_usd = spend.webtool_usd + excluded.webtool_usd,
reserved_usd = GREATEST(0, spend.reserved_usd - $7)`,
todayUTC(), mxid, cost.Token, cost.Router, cost.Grounding, cost.WebTool, estimate)
return err
}
// InsertRequestLog writes one analytics row. id is the event id (PRIMARY KEY), so a
// re-logged event is a no-op (ON CONFLICT DO NOTHING) — each event takes exactly one
// terminal path, so this never overwrites a real outcome. The write is isolated: the
// caller runs it off the answer path and only logs a failure, never drops the reply.
func (s *Store) InsertRequestLog(rl RequestLog) error {
ctx, cancel := opContext()
defer cancel()
models, err := json.Marshal(rl.Models)
if err != nil {
return err
}
stages, err := json.Marshal(rl.StageMS)
if err != nil {
return err
}
// query_text is NULL unless text capture is on (the struct carries "" otherwise),
// so the analytics table never holds message content by default.
var queryText any
if rl.QueryText != "" {
queryText = rl.QueryText
}
_, err = s.pool.Exec(ctx, `
INSERT INTO request_log (
id, room_id, sender, route, router_source, router_confidence, models,
prompt_tokens, cached_tokens, completion_tokens,
token_usd, grounding_usd, router_usd, webtool_usd, total_usd,
latency_ms, stage_ms, escalated, fallback_fired, cache_hit, ceiling_hit,
per_user_cap_hit, prompt_version, provider_request_id, degraded, err, ok, query_text
) VALUES (
$1, $2, $3, $4, $5, $6, $7,
$8, $9, $10,
$11, $12, $13, $14, $15,
$16, $17, $18, $19, $20, $21,
$22, $23, $24, $25, $26, $27, $28
) ON CONFLICT (id) DO NOTHING`,
rl.ID, rl.RoomID, rl.Sender, rl.Route, rl.RouterSource, rl.RouterConfidence, models,
rl.PromptTokens, rl.CachedTokens, rl.CompletionTokens,
rl.Cost.Token, rl.Cost.Grounding, rl.Cost.Router, rl.Cost.WebTool, rl.Cost.Total(),
rl.LatencyMS, stages, rl.Escalated, rl.FallbackFired, rl.CacheHit, rl.CeilingHit,
rl.PerUserCapHit, rl.PromptVersion, rl.ProviderRequestID, rl.Degraded, rl.Err, rl.OK, queryText)
return err
}
// TrimRequestLog deletes analytics rows older than the cutoff (time-based, since the
// data is a time series — unlike the count-bounded dedup tables). A no-op for a zero
// cutoff. Cheap given the ts index.
func (s *Store) TrimRequestLog(olderThan time.Time) error {
ctx, cancel := opContext()
defer cancel()
_, err := s.pool.Exec(ctx, `DELETE FROM request_log WHERE ts < $1`, olderThan)
return err
}
// IncrGroundingIfUnder atomically admits one grounded prompt for today if the day's
// count is below cap, returning whether it was admitted. The check-and-increment is a
// single statement, so concurrent grounding calls can't race past the cap and into the
// per-1k overage (§8.2.3). A non-positive cap denies everything (grounding effectively
// off). The counter is day-keyed and self-resets at UTC midnight.
func (s *Store) IncrGroundingIfUnder(cap int) (bool, error) {
if cap <= 0 {
return false, nil
}
ctx, cancel := opContext()
defer cancel()
var n int
err := s.pool.QueryRow(ctx, `
INSERT INTO grounding_count (date, n) VALUES ($1, 1)
ON CONFLICT (date) DO UPDATE SET n = grounding_count.n + 1
WHERE grounding_count.n < $2
RETURNING n`, todayUTC(), cap).Scan(&n)
if errors.Is(err, pgx.ErrNoRows) {
return false, nil // at/over cap — the conflict update was filtered out
}
if err != nil {
return false, err
}
return true, nil
}
// HasWarnedEncrypted / SetWarnedEncrypted persist the one-shot "reacted 🔒 to this // HasWarnedEncrypted / SetWarnedEncrypted persist the one-shot "reacted 🔒 to this
// room because I can't read encryption" flag so a restart doesn't re-react on every // room because I can't read encryption" flag so a restart doesn't re-react on every
// message (F5). The bot never reacts to its own events: m.reaction is not an // message (F5). The bot never reacts to its own events: m.reaction is not an

View file

@ -1,9 +1,11 @@
package main package main
import ( import (
"fmt"
"sync" "sync"
"sync/atomic" "sync/atomic"
"testing" "testing"
"time"
) )
// These tests exercise the Postgres-backed store directly. They run only when // These tests exercise the Postgres-backed store directly. They run only when
@ -84,23 +86,23 @@ func TestStoreLimiterPerUserCap(t *testing.T) {
const cap, ceiling = 2, 100.0 const cap, ceiling = 2, 100.0
for i := 0; i < cap; i++ { for i := 0; i < cap; i++ {
if res, err := st.Reserve(user, cap, ceiling); err != nil || res != reserveOK { if res, err := st.Reserve(user, cap, 0, ceiling, 0); err != nil || res != reserveOK {
t.Fatalf("reserve %d: got (%v,%v), want reserveOK", i, res, err) t.Fatalf("reserve %d: got (%v,%v), want reserveOK", i, res, err)
} }
} }
// The (cap+1)th request is denied per-user. // The (cap+1)th request is denied per-user.
if res, err := st.Reserve(user, cap, ceiling); err != nil || res != reserveDeniedUser { if res, err := st.Reserve(user, cap, 0, ceiling, 0); err != nil || res != reserveDeniedUser {
t.Fatalf("over-cap reserve: got (%v,%v), want reserveDeniedUser", res, err) t.Fatalf("over-cap reserve: got (%v,%v), want reserveDeniedUser", res, err)
} }
// A different user is unaffected. // A different user is unaffected.
if res, err := st.Reserve("@v:vojo.chat", cap, ceiling); err != nil || res != reserveOK { if res, err := st.Reserve("@v:vojo.chat", cap, 0, ceiling, 0); err != nil || res != reserveOK {
t.Fatalf("other user reserve: got (%v,%v), want reserveOK", res, err) t.Fatalf("other user reserve: got (%v,%v), want reserveOK", res, err)
} }
// Refund returns a slot, so the first user can reserve once more. // Refund returns a slot, so the first user can reserve once more.
if err := st.RefundRequest(user); err != nil { if err := st.RefundRequest(user); err != nil {
t.Fatalf("refund: %v", err) t.Fatalf("refund: %v", err)
} }
if res, err := st.Reserve(user, cap, ceiling); err != nil || res != reserveOK { if res, err := st.Reserve(user, cap, 0, ceiling, 0); err != nil || res != reserveOK {
t.Fatalf("post-refund reserve: got (%v,%v), want reserveOK", res, err) t.Fatalf("post-refund reserve: got (%v,%v), want reserveOK", res, err)
} }
} }
@ -110,7 +112,7 @@ func TestStoreLimiterPerUserCap(t *testing.T) {
func TestStoreLimiterZeroCap(t *testing.T) { func TestStoreLimiterZeroCap(t *testing.T) {
st := openTestStore(t) st := openTestStore(t)
defer st.Close() defer st.Close()
if res, err := st.Reserve("@u:vojo.chat", 0, 100.0); err != nil || res != reserveDeniedUser { if res, err := st.Reserve("@u:vojo.chat", 0, 0, 100.0, 0); err != nil || res != reserveDeniedUser {
t.Fatalf("zero-cap reserve: got (%v,%v), want reserveDeniedUser", res, err) t.Fatalf("zero-cap reserve: got (%v,%v), want reserveDeniedUser", res, err)
} }
} }
@ -121,7 +123,7 @@ func TestStoreLimiterZeroCap(t *testing.T) {
func TestStoreLimiterZeroCeiling(t *testing.T) { func TestStoreLimiterZeroCeiling(t *testing.T) {
st := openTestStore(t) st := openTestStore(t)
defer st.Close() defer st.Close()
if res, err := st.Reserve("@u:vojo.chat", 1_000_000, 0); err != nil || res != reserveDeniedGlobal { if res, err := st.Reserve("@u:vojo.chat", 1_000_000, 0, 0, 0); err != nil || res != reserveDeniedGlobal {
t.Fatalf("zero-ceiling reserve on empty store: got (%v,%v), want reserveDeniedGlobal", res, err) t.Fatalf("zero-ceiling reserve on empty store: got (%v,%v), want reserveDeniedGlobal", res, err)
} }
} }
@ -131,18 +133,18 @@ func TestStoreLimiterGlobalCeiling(t *testing.T) {
defer st.Close() defer st.Close()
const ceiling = 1.0 const ceiling = 1.0
// Book spend up to the ceiling (Reconcile is what feeds the global gate). // Book spend up to the ceiling (Settle is what feeds the global gate).
if err := st.Reconcile("@a:vojo.chat", 0.6); err != nil { if err := st.Settle("@a:vojo.chat", 0, CostBreakdown{Token: 0.6}); err != nil {
t.Fatalf("reconcile a: %v", err) t.Fatalf("settle a: %v", err)
} }
if err := st.Reconcile("@b:vojo.chat", 0.5); err != nil { if err := st.Settle("@b:vojo.chat", 0, CostBreakdown{Token: 0.5}); err != nil {
t.Fatalf("reconcile b: %v", err) t.Fatalf("settle b: %v", err)
} }
if spent, err := st.SpentTodayUSD(); err != nil || spent < 1.1 { if spent, err := st.SpentTodayUSD(); err != nil || spent < 1.1 {
t.Fatalf("spent today: got (%v,%v), want >= 1.1", spent, err) t.Fatalf("spent today: got (%v,%v), want >= 1.1", spent, err)
} }
// Now any reservation is denied globally, regardless of the per-user cap. // Now any reservation is denied globally, regardless of the per-user cap.
if res, err := st.Reserve("@c:vojo.chat", 1_000_000, ceiling); err != nil || res != reserveDeniedGlobal { if res, err := st.Reserve("@c:vojo.chat", 1_000_000, 0, ceiling, 0); err != nil || res != reserveDeniedGlobal {
t.Fatalf("over-ceiling reserve: got (%v,%v), want reserveDeniedGlobal", res, err) t.Fatalf("over-ceiling reserve: got (%v,%v), want reserveDeniedGlobal", res, err)
} }
} }
@ -165,7 +167,7 @@ func TestStoreReserveConcurrentRespectsCap(t *testing.T) {
wg.Add(1) wg.Add(1)
go func() { go func() {
defer wg.Done() defer wg.Done()
res, err := st.Reserve(user, cap, 1e9) res, err := st.Reserve(user, cap, 0, 1e9, 0)
if err != nil { if err != nil {
t.Errorf("reserve: %v", err) t.Errorf("reserve: %v", err)
return return
@ -181,6 +183,266 @@ func TestStoreReserveConcurrentRespectsCap(t *testing.T) {
} }
} }
// TestStoreReserveConcurrentCeilingBounded is the §8.1 TOCTOU regression. Many
// DIFFERENT users reserving at once against a low ceiling must not overshoot it by
// more than ONE max-reservation. The bare pgx port's per-(date,mxid) lock left the
// cross-user ceiling unprotected: every user read the same committed SUM(usd)=0 (the
// USD only lands at settle, after the call) and slipped through, so all N were
// admitted. The per-day admission lock + reserved_usd here bound the overshoot.
// Run under -race.
func TestStoreReserveConcurrentCeilingBounded(t *testing.T) {
st := openTestStore(t)
defer st.Close()
const estimate = 1.0 // each in-flight call reserves $1
const ceiling = 10.0 // so the gate should admit ~10, not 100
const perUserCap = 1_000_000 // keep the per-user cap out of the way
const goroutines = 100
var ok int64
var wg sync.WaitGroup
for i := 0; i < goroutines; i++ {
wg.Add(1)
go func(n int) {
defer wg.Done()
user := fmt.Sprintf("@u%d:vojo.chat", n) // a DIFFERENT user each time
res, err := st.Reserve(user, perUserCap, 0, ceiling, estimate)
if err != nil {
t.Errorf("reserve: %v", err)
return
}
if res == reserveOK {
atomic.AddInt64(&ok, 1)
}
}(i)
}
wg.Wait()
// committed+reserved < ceiling admits; the last admit can push reserved to just
// under ceiling+estimate, so admitted ≤ ceiling/estimate + 1. The pre-fix code
// admitted all 100.
maxAdmit := int64(ceiling/estimate) + 1
if ok < 1 || ok > maxAdmit {
t.Fatalf("admitted %d different users, want in [1, %d] (ceiling + one max-reserve)", ok, maxAdmit)
}
// Nothing was settled, so committed spend is still 0 — the cap came purely from
// reservations, which is the whole point (the USD isn't known until after the call).
if spent, err := st.SpentTodayUSD(); err != nil || spent != 0 {
t.Fatalf("committed spend = (%v,%v), want 0 (only reservations held)", spent, err)
}
}
// TestStoreSettleReleasesReservation verifies that Settle frees the reservation it
// books actual cost for, restoring global headroom — proven through the admission
// gate so it doesn't depend on reading the private column.
func TestStoreSettleReleasesReservation(t *testing.T) {
st := openTestStore(t)
defer st.Close()
const est = 5.0
const ceiling = 10.0
// Two reservations fill the ceiling (reserved 5 + 5 = 10); the third is denied.
if res, _ := st.Reserve("@a:vojo.chat", 1_000_000, 0, ceiling, est); res != reserveOK {
t.Fatalf("reserve a: %v", res)
}
if res, _ := st.Reserve("@b:vojo.chat", 1_000_000, 0, ceiling, est); res != reserveOK {
t.Fatalf("reserve b: %v", res)
}
if res, _ := st.Reserve("@c:vojo.chat", 1_000_000, 0, ceiling, est); res != reserveDeniedGlobal {
t.Fatalf("reserve c over full ceiling: got %v, want denied", res)
}
// Settle a with a small actual cost: reserved 10→5, committed 0→0.01. Headroom
// returns, so a new reservation is admitted again.
if err := st.Settle("@a:vojo.chat", est, CostBreakdown{Token: 0.01}); err != nil {
t.Fatalf("settle a: %v", err)
}
if res, _ := st.Reserve("@d:vojo.chat", 1_000_000, 0, ceiling, est); res != reserveOK {
t.Fatalf("reserve d after settle freed headroom: got %v, want reserveOK", res)
}
if spent, _ := st.SpentTodayUSD(); spent < 0.009 || spent > 0.011 {
t.Fatalf("committed after one settle = %v, want ~0.01", spent)
}
}
// TestStoreReleaseReservation verifies the call-failed path: a released reservation
// frees headroom and books no USD, and an over-release clamps reserved_usd to 0
// rather than going negative (a negative reservation would manufacture phantom
// headroom past the ceiling).
func TestStoreReleaseReservation(t *testing.T) {
st := openTestStore(t)
defer st.Close()
const est = 5.0
const ceiling = 10.0
// Reserve a, then over-release it by far more than it held.
if res, _ := st.Reserve("@a:vojo.chat", 1_000_000, 0, ceiling, est); res != reserveOK {
t.Fatalf("reserve a: %v", res)
}
if err := st.ReleaseReservation("@a:vojo.chat", 100); err != nil {
t.Fatalf("over-release: %v", err)
}
// a's reserved must now be 0 (not -95): exactly two more $5 reservations fit the
// $10 ceiling, and the third is denied. Were reserved negative, far more would slip
// through — so the deny at the third request proves both the headroom was freed and
// the clamp held.
if res, _ := st.Reserve("@b:vojo.chat", 1_000_000, 0, ceiling, est); res != reserveOK {
t.Fatalf("reserve b: %v", res)
}
if res, _ := st.Reserve("@c:vojo.chat", 1_000_000, 0, ceiling, est); res != reserveOK {
t.Fatalf("reserve c: %v", res)
}
if res, _ := st.Reserve("@d:vojo.chat", 1_000_000, 0, ceiling, est); res != reserveDeniedGlobal {
t.Fatalf("reserve d: got %v, want denied (reserved must have clamped to 0, not gone negative)", res)
}
// Nothing was ever settled, so committed spend stays 0 — release books no USD.
if spent, _ := st.SpentTodayUSD(); spent != 0 {
t.Fatalf("committed after release = %v, want 0 (a failed call bills nothing)", spent)
}
}
// TestStoreRequestLog covers the analytics row: total_usd is the component sum,
// query_text is NULL unless captured, re-inserting one id is a no-op, and the
// time-based trim removes old rows.
func TestStoreRequestLog(t *testing.T) {
st := openTestStore(t)
defer st.Close()
noText := RequestLog{
ID: "$ev-rl-1", RoomID: "!r:vojo.chat", Sender: "@u:vojo.chat",
Route: routeGrokDirect, RouterSource: "default",
Models: map[string]string{"final": "grok-x"},
Cost: CostBreakdown{Token: 0.01, Grounding: 0.02},
LatencyMS: 1234, StageMS: map[string]int{"final": 1200},
ProviderRequestID: "prov-1", OK: true, // QueryText empty → NULL
}
if err := st.InsertRequestLog(noText); err != nil {
t.Fatalf("insert: %v", err)
}
// Re-inserting the same id is a no-op (ON CONFLICT DO NOTHING), not an error.
if err := st.InsertRequestLog(noText); err != nil {
t.Fatalf("re-insert: %v", err)
}
withText := RequestLog{ID: "$ev-rl-2", Route: routeTrivial, OK: false, QueryText: "hello"}
if err := st.InsertRequestLog(withText); err != nil {
t.Fatalf("insert-with-text: %v", err)
}
ctx, cancel := opContext()
defer cancel()
var route string
var total float64
var ok bool
var qt *string
if err := st.pool.QueryRow(ctx,
`SELECT route, total_usd, ok, query_text FROM request_log WHERE id = $1`, noText.ID).
Scan(&route, &total, &ok, &qt); err != nil {
t.Fatalf("read row1: %v", err)
}
if route != routeGrokDirect || !ok {
t.Fatalf("row1 = (%q, ok=%v), want (grok_direct, true)", route, ok)
}
if d := total - 0.03; d > 1e-9 || d < -1e-9 {
t.Fatalf("row1 total_usd = %v, want 0.03 (token+grounding)", total)
}
if qt != nil {
t.Fatalf("row1 query_text = %q, want NULL when text capture off", *qt)
}
if err := st.pool.QueryRow(ctx, `SELECT query_text FROM request_log WHERE id = $1`, withText.ID).Scan(&qt); err != nil {
t.Fatalf("read row2: %v", err)
}
if qt == nil || *qt != "hello" {
t.Fatalf("row2 query_text = %v, want \"hello\"", qt)
}
// Trim everything older than one hour from now → both rows (ts<now) gone.
if err := st.TrimRequestLog(time.Now().Add(time.Hour)); err != nil {
t.Fatalf("trim: %v", err)
}
var count int
if err := st.pool.QueryRow(ctx, `SELECT count(*) FROM request_log`).Scan(&count); err != nil {
t.Fatalf("count: %v", err)
}
if count != 0 {
t.Fatalf("after trim count = %d, want 0", count)
}
}
// TestStorePerUserUSDCap covers the optional per-user $ quota: a user is denied once
// their own committed+reserved spend reaches the cap, other users are unaffected, and a
// zero cap disables the check.
func TestStorePerUserUSDCap(t *testing.T) {
st := openTestStore(t)
defer st.Close()
const user = "@u:vojo.chat"
const perUserUSD = 1.0
if err := st.Settle(user, 0, CostBreakdown{Token: 0.9}); err != nil {
t.Fatalf("settle: %v", err)
}
// $0.9 < $1.0 cap → admitted.
if res, err := st.Reserve(user, 1_000_000, perUserUSD, 1e9, 0); err != nil || res != reserveOK {
t.Fatalf("under per-user USD: (%v,%v), want reserveOK", res, err)
}
// Push the user over the cap.
if err := st.Settle(user, 0, CostBreakdown{Token: 0.5}); err != nil { // now $1.4
t.Fatalf("settle: %v", err)
}
if res, err := st.Reserve(user, 1_000_000, perUserUSD, 1e9, 0); err != nil || res != reserveDeniedUser {
t.Fatalf("over per-user USD: (%v,%v), want reserveDeniedUser", res, err)
}
// A different user is unaffected by the first user's spend.
if res, _ := st.Reserve("@v:vojo.chat", 1_000_000, perUserUSD, 1e9, 0); res != reserveOK {
t.Fatal("other user must be unaffected by the first user's per-user USD")
}
// perUserUSD == 0 disables the check entirely (the big spender is admitted again).
if res, _ := st.Reserve(user, 1_000_000, 0, 1e9, 0); res != reserveOK {
t.Fatal("perUserUSD=0 must disable the per-user $ cap")
}
}
// TestStoreGroundingCap covers the durable grounding cap guard: it admits up to the
// cap, then denies; a non-positive cap denies everything.
func TestStoreGroundingCap(t *testing.T) {
st := openTestStore(t)
defer st.Close()
const cap = 3
for i := 0; i < cap; i++ {
if ok, err := st.IncrGroundingIfUnder(cap); err != nil || !ok {
t.Fatalf("grounding %d: (%v,%v), want admitted", i, ok, err)
}
}
if ok, err := st.IncrGroundingIfUnder(cap); err != nil || ok {
t.Fatalf("over-cap grounding: (%v,%v), want denied", ok, err)
}
if ok, _ := st.IncrGroundingIfUnder(0); ok {
t.Fatal("cap 0 must deny everything (grounding off)")
}
}
// TestStoreGroundingCapConcurrent: the atomic check-increment must admit EXACTLY cap
// under a concurrent burst, so a spike can't blow past the $/1k overage. Run under -race.
func TestStoreGroundingCapConcurrent(t *testing.T) {
st := openTestStore(t)
defer st.Close()
const cap = 10
const goroutines = 50
var ok int64
var wg sync.WaitGroup
for i := 0; i < goroutines; i++ {
wg.Add(1)
go func() {
defer wg.Done()
if a, err := st.IncrGroundingIfUnder(cap); err == nil && a {
atomic.AddInt64(&ok, 1)
}
}()
}
wg.Wait()
if ok != cap {
t.Fatalf("concurrent grounding admitted %d, want exactly %d", ok, cap)
}
}
func TestStoreWarnedEncrypted(t *testing.T) { func TestStoreWarnedEncrypted(t *testing.T) {
st := openTestStore(t) st := openTestStore(t)
const room = "!enc:vojo.chat" const room = "!enc:vojo.chat"

124
apps/ai-bot/telemetry.go Normal file
View file

@ -0,0 +1,124 @@
package main
import "time"
// telemetry.go is the request_log analytics path: it captures route, cost, latency
// and outcome for each engaged request so the real $/day and route mix can be
// MEASURED (the build plan's whole "is the cascade worth it" question) instead of
// modelled. It is strictly off the answer path — gated by TELEMETRY_ENABLED, written
// in a recovered goroutine, and a write failure only logs a WARN. A request never
// fails to be answered because telemetry couldn't be recorded.
// Route names (also the request_log.route values). grok_direct is today's path; the
// rest land behind flags in later phases. "none" means no model ran (a skip or a
// limiter denial).
const (
routeNone = "none"
routeGrokDirect = "grok_direct"
routeTrivial = "trivial_direct"
routeWebThenGrok = "web_then_grok"
routeReason = "reason_then_grok"
)
// Degrade/skip reason strings (request_log.degraded). Stable tokens so the analytics
// can GROUP BY them.
const (
degradeEncrypted = "encrypted_room"
degradeMedia = "media"
degradeForeign = "foreign_room"
degradeEmpty = "empty_completion"
degradeSendFailed = "send_failed"
degradeReserveErr = "reserve_error"
degradeRouter = "router_failed"
degradeWeb = "web_failed"
degradeTrivial = "trivial_failed"
degradeGroundCap = "grounding_cap"
degradeReasoning = "reasoning_failed"
)
// telemetryTrimEvery bounds how often the retention trim runs — once per N writes,
// off the hot path, so the analytics table stays time-bounded without a separate
// lifecycle or a DELETE on every insert.
const telemetryTrimEvery = 200
// RequestLog is one analytics row (the request_log columns). Zero values are the
// "didn't apply" case — a grok_direct request leaves the cascade fields zero.
type RequestLog struct {
ID string
RoomID string
Sender string
Route string
RouterSource string // heuristic|classifier|default|forced|degraded
RouterConfidence float64
Models map[string]string // {"router":"…","final":"…"}
PromptTokens int
CachedTokens int
CompletionTokens int
Cost CostBreakdown
LatencyMS int
StageMS map[string]int // {"router":12,"web":1400,"final":2100}
Escalated bool
FallbackFired bool
CacheHit bool
CeilingHit bool
PerUserCapHit bool
PromptVersion string
ProviderRequestID string
Degraded string
Err string
OK bool
QueryText string // stored only when TELEMETRY_STORE_TEXT; stripped otherwise
}
// recordTelemetry persists a row off the answer path. No-op unless TELEMETRY_ENABLED.
// The query text is stripped unless TELEMETRY_STORE_TEXT, so message content never
// lands in the analytics table by default. Runs in a recovered goroutine and only
// logs failures, so it can never drop or delay the reply.
func (b *Bot) recordTelemetry(rl RequestLog) {
if !b.cfg.TelemetryEnabled {
return
}
if !b.cfg.TelemetryStoreText {
rl.QueryText = ""
}
b.safego("telemetry", func() {
if err := b.st.InsertRequestLog(rl); err != nil {
b.log.Warn("request_log insert failed (non-fatal)", "id", rl.ID, "err", err)
}
b.maybeTrimTelemetry()
})
}
// recordSkip logs a request the bot was addressed by but couldn't fully serve before
// any model ran (encrypted/media/foreign). These are low-frequency, so a direct row
// (route=none + reason) keeps the "why no answer" visible without flooding the table
// with the common not-addressed drops, which are not logged (pre-claim best-effort).
func (b *Bot) recordSkip(ev *Event, reason string) {
b.recordTelemetry(RequestLog{
ID: ev.EventID,
RoomID: ev.RoomID,
Sender: ev.Sender,
Route: routeNone,
RouterSource: "default",
PromptVersion: b.promptVersion,
Degraded: reason,
OK: false,
})
}
// maybeTrimTelemetry runs the time-based retention trim once per telemetryTrimEvery
// writes. Best-effort and off the hot path (called from the telemetry goroutine).
func (b *Bot) maybeTrimTelemetry() {
if b.cfg.TelemetryRetention <= 0 {
return
}
if b.telemetryWrites.Add(1)%telemetryTrimEvery != 0 {
return
}
if err := b.st.TrimRequestLog(time.Now().Add(-b.cfg.TelemetryRetention)); err != nil {
b.log.Warn("request_log trim failed (non-fatal)", "err", err)
}
}

View file

@ -0,0 +1,71 @@
package main
import (
"io"
"log/slog"
"testing"
"time"
)
// newTestBot builds a Bot with just the fields the telemetry path needs — no network,
// so it sidesteps NewBot's identity check.
func newTestBot(st *Store, cfg *Config) *Bot {
return &Bot{cfg: cfg, st: st, log: slog.New(slog.NewTextHandler(io.Discard, nil)), promptVersion: "testv"}
}
func requestLogCount(t *testing.T, st *Store) int {
t.Helper()
ctx, cancel := opContext()
defer cancel()
var n int
if err := st.pool.QueryRow(ctx, `SELECT count(*) FROM request_log`).Scan(&n); err != nil {
t.Fatalf("count: %v", err)
}
return n
}
// TestRecordSkipWritesRow proves the early-return telemetry path actually records a
// row (route=none + the skip reason) when TELEMETRY_ENABLED is on. The write is async,
// so poll briefly.
func TestRecordSkipWritesRow(t *testing.T) {
st := openTestStore(t)
defer st.Close()
b := newTestBot(st, &Config{TelemetryEnabled: true})
ev := &Event{EventID: "$skip-1", RoomID: "!r:vojo.chat", Sender: "@u:vojo.chat"}
b.recordSkip(ev, degradeMedia)
deadline := time.Now().Add(2 * time.Second)
for requestLogCount(t, st) == 0 && time.Now().Before(deadline) {
time.Sleep(20 * time.Millisecond)
}
if n := requestLogCount(t, st); n != 1 {
t.Fatalf("telemetry rows = %d, want 1", n)
}
ctx, cancel := opContext()
defer cancel()
var route, degraded string
if err := st.pool.QueryRow(ctx,
`SELECT route, degraded FROM request_log WHERE id = $1`, ev.EventID).Scan(&route, &degraded); err != nil {
t.Fatalf("read: %v", err)
}
if route != routeNone || degraded != degradeMedia {
t.Fatalf("row = (%q,%q), want (none, media)", route, degraded)
}
}
// TestTelemetryDisabledWritesNothing proves the default (TELEMETRY_ENABLED off) adds
// no write path — strict "cascade-off == today".
func TestTelemetryDisabledWritesNothing(t *testing.T) {
st := openTestStore(t)
defer st.Close()
b := newTestBot(st, &Config{TelemetryEnabled: false})
b.recordSkip(&Event{EventID: "$skip-2", RoomID: "!r:vojo.chat", Sender: "@u:vojo.chat"}, degradeMedia)
// Give any (incorrect) async write time to land, then assert nothing was written.
time.Sleep(200 * time.Millisecond)
if n := requestLogCount(t, st); n != 0 {
t.Fatalf("telemetry rows = %d, want 0 (TELEMETRY_ENABLED off)", n)
}
}

View file

@ -1,6 +1,17 @@
package main package main
import "sync" import (
"hash/fnv"
"sync"
)
// hashString is a cheap, stable 32-bit hash (FNV-1a). Used for opaque, non-identifying
// derived ids (e.g. the prompt-cache conv id) — not for security.
func hashString(s string) uint32 {
h := fnv.New32a()
_, _ = h.Write([]byte(s))
return h.Sum32()
}
// lruSet is a bounded insertion-ordered string set used for event-id dedup and // lruSet is a bounded insertion-ordered string set used for event-id dedup and
// tracking our own sent event ids. Oldest entries evict once cap is reached. // tracking our own sent event ids. Oldest entries evict once cap is reached.

223
apps/ai-bot/web.go Normal file
View file

@ -0,0 +1,223 @@
package main
import (
"bytes"
"context"
"encoding/json"
"errors"
"fmt"
"io"
"log/slog"
"net/http"
)
// web.go is the pluggable web-freshness layer (Phase 3). A WebProvider fetches a
// grounded factual digest + source URLs for a query; the cascade then has Grok
// synthesise the final answer in voice from that digest. Two providers, chosen by
// WEB_PROVIDER:
//
// - grok_web_search (DEFAULT): the xAI Agent Tools `web_search` tool on the Responses
// API (/v1/responses). NB the older chat/completions Live Search `search_parameters`
// mechanism was RETIRED by xAI (now 410 Gone), and the web_search tool is not on
// chat/completions — hence the Responses endpoint. Billed $5/1k tool calls + tokens.
// - gemini_grounding: Gemini native v1beta google_search. Cheaper, but Gemini-3 only
// and silently ungrounds otherwise (F-EXT-3) — so it runs behind a citations
// verify-gate and degrades if it fails.
//
// The web call is bounded by a per-stage timeout (and gemini_grounding additionally by a
// durable daily cap), and either provider failing degrades the request to grok_direct
// with a staleness hedge (never silence, never stale-as-fresh).
//
// The grok_web_search Responses-API request/response shape was VALIDATED live against
// /v1/responses (2026-06-01): output[].type=="message" → content[].output_text + inline
// url_citation annotations; usage carries input/output tokens, cached subset, and the
// web_search_calls count (one request can search several times — each billed). The
// computed cost matched the API's own cost_in_usd_ticks to 4 dp. A parse miss still
// degrades safely (empty digest → grok_direct).
const (
webProviderGrokWebSearch = "grok_web_search"
webProviderGeminiGrounding = "gemini_grounding"
// grokWebSearchPerCall is xAI's Agent Tools fee: $5 per 1,000 web_search tool calls.
grokWebSearchPerCall = 5.0 / 1000.0
// maxWebSearchCalls bounds the per-call fee in the reservation envelope (one Responses
// request can search several times; the actual count is billed exactly at settle).
maxWebSearchCalls = 4
)
// errGroundingCapped signals the daily web/grounded-prompt cap was hit, so the caller
// degrades (with a hedge) rather than paying past the cap.
var errGroundingCapped = errors.New("web grounding daily cap reached")
// WebContext is the result of a web fetch: a factual digest to feed the final model,
// the sources behind it, the fetch's own token usage, and the cost the fetch incurred
// (kept separate from the final synthesis tokens so each books to its own ledger
// column). Cost is populated even when Digest is empty/failed, because the call was
// still billed — the caller books it before degrading (§8.1 partial cascade).
type WebContext struct {
Digest string
Citations []string
Usage Usage
Cost CostBreakdown
}
// WebProvider fetches grounded facts for a query. Stateless. It returns its cost in the
// WebContext even on error (the call was billed), and an error when the digest is
// unusable so the caller can degrade.
type WebProvider interface {
Fetch(ctx context.Context, query string) (WebContext, error)
}
// --- grok_web_search (default): xAI Agent Tools web_search on the Responses API -------
type grokWebSearch struct {
base string
key string
model string
cfg *Config
httpc *http.Client
logger *slog.Logger
}
func newGrokWebSearch(cfg *Config, logger *slog.Logger) *grokWebSearch {
return &grokWebSearch{
base: cfg.XAIBaseURL, key: cfg.XAIAPIKey, model: cfg.XAIModel,
cfg: cfg, httpc: &http.Client{}, logger: logger,
}
}
type grokResponsesRequest struct {
Model string `json:"model"`
Input string `json:"input"`
Tools []openAITool `json:"tools"`
// Keep the fetch fast/cheap when the operator runs a unified model with effort
// "none"; empty → not sent (provider default). Validated against /v1/responses.
ReasoningEffort string `json:"reasoning_effort,omitempty"`
}
// grokResponsesResponse maps the xAI Responses API shape (verified live 2026-06-01):
// output[] carries reasoning/web_search_call/message items; the message item's content
// has output_text (with inline url_citation annotations); usage reports tokens, the
// cached subset, and the count of server-side web_search calls (a single request can
// make several, each billed).
type grokResponsesResponse struct {
Output []struct {
Type string `json:"type"`
Content []struct {
Type string `json:"type"`
Text string `json:"text"`
Annotations []struct {
Type string `json:"type"`
URL string `json:"url"`
} `json:"annotations"`
} `json:"content"`
} `json:"output"`
Usage struct {
InputTokens int `json:"input_tokens"`
OutputTokens int `json:"output_tokens"`
InputTokensDetails struct {
CachedTokens int `json:"cached_tokens"`
} `json:"input_tokens_details"`
ServerSideToolUsageDetails struct {
WebSearchCalls int `json:"web_search_calls"`
} `json:"server_side_tool_usage_details"`
} `json:"usage"`
}
func (p *grokWebSearch) Fetch(ctx context.Context, query string) (WebContext, error) {
body, err := json.Marshal(grokResponsesRequest{
Model: p.model, Input: query, Tools: []openAITool{{Type: "web_search"}},
ReasoningEffort: p.cfg.GrokReasoningEffort,
})
if err != nil {
return WebContext{}, err
}
req, err := http.NewRequestWithContext(ctx, http.MethodPost, p.base+"/responses", bytes.NewReader(body))
if err != nil {
return WebContext{}, err
}
req.Header.Set("Content-Type", "application/json")
req.Header.Set("Authorization", "Bearer "+p.key)
resp, err := p.httpc.Do(req)
if err != nil {
return WebContext{}, err
}
defer resp.Body.Close()
data, _ := io.ReadAll(resp.Body)
if resp.StatusCode < 200 || resp.StatusCode >= 300 {
return WebContext{}, fmt.Errorf("grok web search http %d: %s", resp.StatusCode, snippet(data))
}
var out grokResponsesResponse
if err := json.Unmarshal(data, &out); err != nil {
return WebContext{}, fmt.Errorf("grok web search decode: %w", err)
}
var digest string
var citations []string
for _, item := range out.Output {
if item.Type != "message" {
continue
}
for _, c := range item.Content {
if c.Type == "output_text" {
digest += c.Text
}
for _, a := range c.Annotations {
if a.Type == "url_citation" && a.URL != "" {
citations = append(citations, a.URL)
}
}
}
}
usage := Usage{
PromptTokens: out.Usage.InputTokens,
CachedTokens: out.Usage.InputTokensDetails.CachedTokens,
CompletionTokens: out.Usage.OutputTokens,
}
// Cost = the call's tokens + the $5/1k fee times the ACTUAL number of web_search
// calls the request made (one request can search several times). Booked even when the
// digest is empty (the 2xx was billed), so the caller accounts for it before degrading.
// Cross-checked live against the API's own cost_in_usd_ticks — matched to 4 dp.
wc := WebContext{
Digest: digest,
Citations: citations,
Usage: usage,
Cost: CostBreakdown{
WebTool: computeUSD(p.model, usage, p.cfg) +
float64(out.Usage.ServerSideToolUsageDetails.WebSearchCalls)*grokWebSearchPerCall,
},
}
if digest == "" {
return wc, fmt.Errorf("grok web search: empty result")
}
return wc, nil
}
// --- gemini_grounding (Gemini-3 native only) --------------------------------------
type geminiGrounding struct {
gem *geminiClient
st *Store
cfg *Config
}
func (p *geminiGrounding) Fetch(ctx context.Context, query string) (WebContext, error) {
// Durable, atomic daily cap FIRST: a grounded prompt is billed whether or not it
// grounds, and the per-prompt overage ($35/1k on 2.5) is the cost this guard exists
// to bound. Admit against the cap before spending. (grok_web_search needs no such
// cap — its $5/1k per-call fee is fully reserved per request and bounded by the
// per-user request cap + global ceiling.)
if ok, err := p.st.IncrGroundingIfUnder(p.cfg.WebGroundingDailyCap); err != nil {
return WebContext{}, err
} else if !ok {
return WebContext{}, errGroundingCapped
}
res, err := p.gem.groundedSearch(ctx, query) // errors (incl. no-citations) → caller degrades
cost := CostBreakdown{Grounding: computeUSD(p.cfg.GeminiModel, res.Usage, p.cfg)}
if err != nil {
return WebContext{Cost: cost, Usage: res.Usage}, err
}
return WebContext{Digest: res.Digest, Citations: res.Citations, Usage: res.Usage, Cost: cost}, nil
}

View file

@ -1,171 +0,0 @@
package main
import (
"bytes"
"context"
"encoding/json"
"fmt"
"io"
"log/slog"
"math/rand"
"net/http"
"time"
)
// XAIClient talks the OpenAI-compatible Chat Completions endpoint at
// {base}/chat/completions with a Bearer key.
type XAIClient struct {
base string
key string
http *http.Client
maxTry int
log *slog.Logger
}
func NewXAIClient(base, key string, logger *slog.Logger) *XAIClient {
return &XAIClient{
base: base,
key: key,
http: &http.Client{},
maxTry: 3,
log: logger,
}
}
type xaiMessage struct {
Role string `json:"role"`
Content string `json:"content"`
}
type xaiRequest struct {
Model string `json:"model"`
Messages []xaiMessage `json:"messages"`
MaxTokens int `json:"max_tokens"`
Temperature float64 `json:"temperature"`
Stream bool `json:"stream"`
}
type xaiUsage struct {
PromptTokens int `json:"prompt_tokens"`
CompletionTokens int `json:"completion_tokens"`
PromptTokensDetails struct {
CachedTokens int `json:"cached_tokens"`
} `json:"prompt_tokens_details"`
}
type xaiResponse struct {
Choices []struct {
Message struct {
Content string `json:"content"`
} `json:"message"`
FinishReason string `json:"finish_reason"`
} `json:"choices"`
Usage xaiUsage `json:"usage"`
}
func (r *xaiResponse) Text() string {
if len(r.Choices) == 0 {
return ""
}
return r.Choices[0].Message.Content
}
// Complete calls Chat Completions with retry on transient failures (429 / 5xx /
// network timeout, exponential backoff + jitter). Non-retryable 4xx fail
// immediately. On exhaustion the caller refunds the reserved request and notifies
// the user, so a transient failure is never silently swallowed (F6).
func (x *XAIClient) Complete(ctx context.Context, model string, msgs []xaiMessage, maxTokens int, temp float64) (*xaiResponse, error) {
reqBody := xaiRequest{
Model: model,
Messages: msgs,
MaxTokens: maxTokens,
Temperature: temp,
Stream: false,
}
payload, err := json.Marshal(reqBody)
if err != nil {
return nil, err
}
var lastErr error
for attempt := 0; attempt < x.maxTry; attempt++ {
if attempt > 0 {
// 0.5s, 1s, 2s … capped at 8s, plus up to 250ms jitter.
backoff := time.Duration(500<<uint(attempt-1)) * time.Millisecond
if backoff > 8*time.Second {
backoff = 8 * time.Second
}
backoff += time.Duration(rand.Intn(250)) * time.Millisecond
select {
case <-ctx.Done():
return nil, ctx.Err()
case <-time.After(backoff):
}
}
resp, retryable, err := x.attempt(ctx, payload)
if err == nil {
return resp, nil
}
lastErr = err
if ctx.Err() != nil {
return nil, ctx.Err()
}
if !retryable {
return nil, err
}
if x.log != nil {
x.log.Warn("xai attempt failed, will retry", "attempt", attempt+1, "max", x.maxTry, "err", err)
}
}
return nil, fmt.Errorf("xai: exhausted %d attempts: %w", x.maxTry, lastErr)
}
// attempt performs one HTTP call. It returns retryable=true for 429/5xx and
// network errors, false for other non-2xx (terminal 4xx).
func (x *XAIClient) attempt(ctx context.Context, payload []byte) (*xaiResponse, bool, error) {
// Per-attempt deadline so a hung connection doesn't block the whole loop.
attemptCtx, cancel := context.WithTimeout(ctx, 60*time.Second)
defer cancel()
req, err := http.NewRequestWithContext(attemptCtx, http.MethodPost, x.base+"/chat/completions", bytes.NewReader(payload))
if err != nil {
return nil, false, err
}
req.Header.Set("Content-Type", "application/json")
req.Header.Set("Authorization", "Bearer "+x.key)
resp, err := x.http.Do(req)
if err != nil {
// Network error / timeout — retryable (unless the parent ctx is done).
return nil, ctx.Err() == nil, err
}
defer resp.Body.Close()
data, _ := io.ReadAll(resp.Body)
if resp.StatusCode == http.StatusTooManyRequests || resp.StatusCode >= 500 {
return nil, true, fmt.Errorf("xai http %d: %s", resp.StatusCode, snippet(data))
}
if resp.StatusCode < 200 || resp.StatusCode >= 300 {
return nil, false, fmt.Errorf("xai http %d: %s", resp.StatusCode, snippet(data))
}
var out xaiResponse
if err := json.Unmarshal(data, &out); err != nil {
return nil, false, fmt.Errorf("xai decode: %w", err)
}
// A 2xx is a billed call even when the model returns empty content (content
// filter, finish_reason=length with no text, or no choices). Return it as a
// success so the caller books the real cost via Reconcile instead of refunding
// the slot and losing the spend — which would let empty replies bypass BOTH the
// per-user cap and the global ceiling. The caller just won't send an empty body.
return &out, false, nil
}
func snippet(b []byte) string {
const max = 300
if len(b) > max {
return string(b[:max]) + "…"
}
return string(b)
}