luke_scribe/.omc/artifacts/ask/gemini-20260603-095739.md

Ripgrep is not available. Falling back to GrepTool.
[ERROR] [IDEClient] Failed to connect to IDE companion extension. Please ensure the extension is running. To install the extension, run /ide install.
Here is a prioritized, read-only review of the STT API design and documentation, analyzing both the Specification and the Consensus Plan.

### 1. REST + WebSocket API Design
The single `Job` abstraction and `options` payload are pragmatic, but the design has several structural and ergonomic gaps:

*   **REST Gaps:**
    *   **Webhooks vs. Polling:** For 4-hour video transcriptions, polling `GET /v1/jobs/{id}` is a severe anti-pattern. An optional `webhook_url` in the `options` schema is critical.
    *   **Idempotency & Pagination:** `POST /v1/jobs` for a 2GB file needs an `Idempotency-Key` header to prevent duplicate processing on transient network drops. Furthermore, `GET /v1/jobs` lacks pagination (cursors/offset), which will break the API after days of heavy usage.
    *   **Result TTL:** The 7-day retention policy is noted, but the API response behavior on day 8 is undefined (should explicitly be `410 Gone`, not `404 Not Found`).
*   **WebSocket (WS) Protocol:**
    *   **Auth Handshake:** Browsers cannot send custom headers (`X-API-Key`) during a WS handshake. The docs must specify passing the key via query parameters (`?api_key=...`) or within the first JSON message payload.
    *   **Codec Negotiation & Backpressure:** The WS schema lacks an explicit audio format declaration (e.g., sample rate, PCM16 vs. Opus). Additionally, while `429` handles REST queue overflow, WS backpressure is undefined (e.g., `{"type": "error", "reason": "buffer_full"}`).
    *   **Reconnection:** There is no mechanism for a client to resume a dropped WS session without losing the `LocalAgreement` context buffer. A `session_id` is required for mid-stream resumption.

### 2. Documentation Clarity & Spec/Plan Contradictions
There is significant drift between the Spec and the Consensus Plan. An engineer implementing this will face contradictions that pose a deployment risk:

*   **VRAM Sizing Drift (Critical):** The Spec estimates `large-v3 fp16` at ~6GB VRAM. The Plan correctly overrides this to **10GB** to account for conservative headroom and sequence length. The Spec must be updated to avoid engineers undersizing GPU instances.
*   **Queue Architecture:** The Spec loosely suggests "RQ/Celery". The Plan definitively locks in **RQ `SimpleWorker` (no-fork)** because standard RQ/Celery `os.fork()` behaviors crash PyTorch CUDA contexts. If an engineer follows the Spec and uses Celery, the application will crash on boot.
*   **Worker Model:** The Spec implies a standard web-worker pool. The Plan enforces a strict "load-once per worker process" architecture to avoid VRAM fragmentation. This constraint must be elevated in the Spec.

### 3. Alternative Approaches
*   **Queue Backend (Redis vs. SQLite):** While Redis/RQ is durable, it bloats the Docker and Colab footprint. **Alternative:** Since this is a local-first API running on a single box, using `taskiq` or `huey` backed by SQLite/file-system eliminates the Redis container entirely while maintaining durability.
*   **Realtime Streaming:** Implementing custom `LocalAgreement-2` over `faster-whisper` (as planned) is notoriously brittle for edge cases (e.g., mid-word VAD slicing). **Alternative:** Adopt the C++ `whisper.cpp` streaming server natively via bindings, which handles VAD, context windowing, and memory stability much more efficiently than a custom Python implementation.
*   **Model Weights Distribution:** Baking weights into Docker or downloading them synchronously on boot will cause timeouts. **Alternative:** Use an init-container or volume mount for weights.

### 4. Edge-Case Usability
*   **First-Run Download Penalty:** A `large-v3` model takes minutes to download. A REST request hitting the API during a cold boot will trigger a timeout. The API needs a `status: "downloading_model"` state.
*   **Colab URL Rotation:** `cloudflared` Quick Tunnels are ephemeral and rotate frequently. If a client is polling a 4-hour job and the tunnel drops, the job is orphaned. The CLI should enforce ngrok auth-tokens or webhooks for long-running batch jobs.
*   **Multi-language Auto-detect:** Passing `"auto"` language to the `turbo` model on a mixed KO/EN clip often results in the model locking onto English and hallucinating Korean phonetics. The `options` schema should support a prioritized language hint array, not just `"auto"`.