Files
luke_scribe/.omc/artifacts/ask/gemini-20260603-095739.md
T
lukehemmin fbe13dddcc chore: initial commit — planning docs and omc project context
Greenfield setup for luke_scribe (local STT transcription API). No source
code yet; this captures the completed design phase so teammates can ramp
through oh-my-claudecode.

Includes:
- .omc/plans/consensus-luke-scribe-stt-api.md — consensus impl plan v2.2
- .omc/specs/deep-interview-luke-scribe-stt-api.md — deep-interview spec
- .omc/artifacts/ask/{codex,gemini}-*.md — external review (CCG)
- .omc/project-memory.json — omc project memory
- opencode.json, .claude/settings.json — shared tooling config
- .gitignore — excludes ephemeral omc state/session logs and local settings

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 10:08:17 +09:00

4.4 KiB

Ripgrep is not available. Falling back to GrepTool. [ERROR] [IDEClient] Failed to connect to IDE companion extension. Please ensure the extension is running. To install the extension, run /ide install. Here is a prioritized, read-only review of the STT API design and documentation, analyzing both the Specification and the Consensus Plan.

1. REST + WebSocket API Design

The single Job abstraction and options payload are pragmatic, but the design has several structural and ergonomic gaps:

  • REST Gaps:
    • Webhooks vs. Polling: For 4-hour video transcriptions, polling GET /v1/jobs/{id} is a severe anti-pattern. An optional webhook_url in the options schema is critical.
    • Idempotency & Pagination: POST /v1/jobs for a 2GB file needs an Idempotency-Key header to prevent duplicate processing on transient network drops. Furthermore, GET /v1/jobs lacks pagination (cursors/offset), which will break the API after days of heavy usage.
    • Result TTL: The 7-day retention policy is noted, but the API response behavior on day 8 is undefined (should explicitly be 410 Gone, not 404 Not Found).
  • WebSocket (WS) Protocol:
    • Auth Handshake: Browsers cannot send custom headers (X-API-Key) during a WS handshake. The docs must specify passing the key via query parameters (?api_key=...) or within the first JSON message payload.
    • Codec Negotiation & Backpressure: The WS schema lacks an explicit audio format declaration (e.g., sample rate, PCM16 vs. Opus). Additionally, while 429 handles REST queue overflow, WS backpressure is undefined (e.g., {"type": "error", "reason": "buffer_full"}).
    • Reconnection: There is no mechanism for a client to resume a dropped WS session without losing the LocalAgreement context buffer. A session_id is required for mid-stream resumption.

2. Documentation Clarity & Spec/Plan Contradictions

There is significant drift between the Spec and the Consensus Plan. An engineer implementing this will face contradictions that pose a deployment risk:

  • VRAM Sizing Drift (Critical): The Spec estimates large-v3 fp16 at ~6GB VRAM. The Plan correctly overrides this to 10GB to account for conservative headroom and sequence length. The Spec must be updated to avoid engineers undersizing GPU instances.
  • Queue Architecture: The Spec loosely suggests "RQ/Celery". The Plan definitively locks in RQ SimpleWorker (no-fork) because standard RQ/Celery os.fork() behaviors crash PyTorch CUDA contexts. If an engineer follows the Spec and uses Celery, the application will crash on boot.
  • Worker Model: The Spec implies a standard web-worker pool. The Plan enforces a strict "load-once per worker process" architecture to avoid VRAM fragmentation. This constraint must be elevated in the Spec.

3. Alternative Approaches

  • Queue Backend (Redis vs. SQLite): While Redis/RQ is durable, it bloats the Docker and Colab footprint. Alternative: Since this is a local-first API running on a single box, using taskiq or huey backed by SQLite/file-system eliminates the Redis container entirely while maintaining durability.
  • Realtime Streaming: Implementing custom LocalAgreement-2 over faster-whisper (as planned) is notoriously brittle for edge cases (e.g., mid-word VAD slicing). Alternative: Adopt the C++ whisper.cpp streaming server natively via bindings, which handles VAD, context windowing, and memory stability much more efficiently than a custom Python implementation.
  • Model Weights Distribution: Baking weights into Docker or downloading them synchronously on boot will cause timeouts. Alternative: Use an init-container or volume mount for weights.

4. Edge-Case Usability

  • First-Run Download Penalty: A large-v3 model takes minutes to download. A REST request hitting the API during a cold boot will trigger a timeout. The API needs a status: "downloading_model" state.
  • Colab URL Rotation: cloudflared Quick Tunnels are ephemeral and rotate frequently. If a client is polling a 4-hour job and the tunnel drops, the job is orphaned. The CLI should enforce ngrok auth-tokens or webhooks for long-running batch jobs.
  • Multi-language Auto-detect: Passing "auto" language to the turbo model on a mixed KO/EN clip often results in the model locking onto English and hallucinating Korean phonetics. The options schema should support a prioritized language hint array, not just "auto".