fbe13dddcc
Greenfield setup for luke_scribe (local STT transcription API). No source
code yet; this captures the completed design phase so teammates can ramp
through oh-my-claudecode.
Includes:
- .omc/plans/consensus-luke-scribe-stt-api.md — consensus impl plan v2.2
- .omc/specs/deep-interview-luke-scribe-stt-api.md — deep-interview spec
- .omc/artifacts/ask/{codex,gemini}-*.md — external review (CCG)
- .omc/project-memory.json — omc project memory
- opencode.json, .claude/settings.json — shared tooling config
- .gitignore — excludes ephemeral omc state/session logs and local settings
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
4.4 KiB
4.4 KiB
Ripgrep is not available. Falling back to GrepTool. [ERROR] [IDEClient] Failed to connect to IDE companion extension. Please ensure the extension is running. To install the extension, run /ide install. Here is a prioritized, read-only review of the STT API design and documentation, analyzing both the Specification and the Consensus Plan.
1. REST + WebSocket API Design
The single Job abstraction and options payload are pragmatic, but the design has several structural and ergonomic gaps:
- REST Gaps:
- Webhooks vs. Polling: For 4-hour video transcriptions, polling
GET /v1/jobs/{id}is a severe anti-pattern. An optionalwebhook_urlin theoptionsschema is critical. - Idempotency & Pagination:
POST /v1/jobsfor a 2GB file needs anIdempotency-Keyheader to prevent duplicate processing on transient network drops. Furthermore,GET /v1/jobslacks pagination (cursors/offset), which will break the API after days of heavy usage. - Result TTL: The 7-day retention policy is noted, but the API response behavior on day 8 is undefined (should explicitly be
410 Gone, not404 Not Found).
- Webhooks vs. Polling: For 4-hour video transcriptions, polling
- WebSocket (WS) Protocol:
- Auth Handshake: Browsers cannot send custom headers (
X-API-Key) during a WS handshake. The docs must specify passing the key via query parameters (?api_key=...) or within the first JSON message payload. - Codec Negotiation & Backpressure: The WS schema lacks an explicit audio format declaration (e.g., sample rate, PCM16 vs. Opus). Additionally, while
429handles REST queue overflow, WS backpressure is undefined (e.g.,{"type": "error", "reason": "buffer_full"}). - Reconnection: There is no mechanism for a client to resume a dropped WS session without losing the
LocalAgreementcontext buffer. Asession_idis required for mid-stream resumption.
- Auth Handshake: Browsers cannot send custom headers (
2. Documentation Clarity & Spec/Plan Contradictions
There is significant drift between the Spec and the Consensus Plan. An engineer implementing this will face contradictions that pose a deployment risk:
- VRAM Sizing Drift (Critical): The Spec estimates
large-v3 fp16at ~6GB VRAM. The Plan correctly overrides this to 10GB to account for conservative headroom and sequence length. The Spec must be updated to avoid engineers undersizing GPU instances. - Queue Architecture: The Spec loosely suggests "RQ/Celery". The Plan definitively locks in RQ
SimpleWorker(no-fork) because standard RQ/Celeryos.fork()behaviors crash PyTorch CUDA contexts. If an engineer follows the Spec and uses Celery, the application will crash on boot. - Worker Model: The Spec implies a standard web-worker pool. The Plan enforces a strict "load-once per worker process" architecture to avoid VRAM fragmentation. This constraint must be elevated in the Spec.
3. Alternative Approaches
- Queue Backend (Redis vs. SQLite): While Redis/RQ is durable, it bloats the Docker and Colab footprint. Alternative: Since this is a local-first API running on a single box, using
taskiqorhueybacked by SQLite/file-system eliminates the Redis container entirely while maintaining durability. - Realtime Streaming: Implementing custom
LocalAgreement-2overfaster-whisper(as planned) is notoriously brittle for edge cases (e.g., mid-word VAD slicing). Alternative: Adopt the C++whisper.cppstreaming server natively via bindings, which handles VAD, context windowing, and memory stability much more efficiently than a custom Python implementation. - Model Weights Distribution: Baking weights into Docker or downloading them synchronously on boot will cause timeouts. Alternative: Use an init-container or volume mount for weights.
4. Edge-Case Usability
- First-Run Download Penalty: A
large-v3model takes minutes to download. A REST request hitting the API during a cold boot will trigger a timeout. The API needs astatus: "downloading_model"state. - Colab URL Rotation:
cloudflaredQuick Tunnels are ephemeral and rotate frequently. If a client is polling a 4-hour job and the tunnel drops, the job is orphaned. The CLI should enforce ngrok auth-tokens or webhooks for long-running batch jobs. - Multi-language Auto-detect: Passing
"auto"language to theturbomodel on a mixed KO/EN clip often results in the model locking onto English and hallucinating Korean phonetics. Theoptionsschema should support a prioritized language hint array, not just"auto".