chore: initial commit — planning docs and omc project context

Greenfield setup for luke_scribe (local STT transcription API). No source code yet; this captures the completed design phase so teammates can ramp through oh-my-claudecode. Includes: - .omc/plans/consensus-luke-scribe-stt-api.md — consensus impl plan v2.2 - .omc/specs/deep-interview-luke-scribe-stt-api.md — deep-interview spec - .omc/artifacts/ask/{codex,gemini}-*.md — external review (CCG) - .omc/project-memory.json — omc project memory - opencode.json, .claude/settings.json — shared tooling config - .gitignore — excludes ephemeral omc state/session logs and local settings Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
2026-06-07 10:08:17 +09:00
commit fbe13dddcc
8 changed files with 3357 additions and 0 deletions
@@ -0,0 +1,5 @@
+{
+  "enabledPlugins": {
+    "oh-my-claudecode@omc": true
+  }
+}
@@ -0,0 +1,40 @@
+# ─── Python ───────────────────────────────────────────────
+__pycache__/
+*.py[cod]
+*.egg-info/
+.eggs/
+build/
+dist/
+.pytest_cache/
+.ruff_cache/
+.mypy_cache/
+
+# Virtual envs (keep uv.lock tracked)
+.venv/
+venv/
+
+# Env / secrets (keep .env.example tracked)
+.env
+.env.*
+!.env.example
+
+# Models / data / scratch
+*.log
+models/
+samples/*.wav
+samples/*.mp4
+
+# ─── OS / editor ──────────────────────────────────────────
+.DS_Store
+*.swp
+
+# ─── Claude Code (personal/local only) ────────────────────
+.claude/settings.local.json
+
+# ─── oh-my-claudecode ─────────────────────────────────────
+# Keep shareable knowledge so teammates can ramp via omc:
+#   .omc/plans/  .omc/specs/  .omc/artifacts/  .omc/project-memory.json
+# Ignore ephemeral, machine-/session-specific runtime state:
+.omc/state/
+.omc/sessions/
+.omc/logs/
@@ -0,0 +1,32 @@
+Ripgrep is not available. Falling back to GrepTool.
+[ERROR] [IDEClient] Failed to connect to IDE companion extension. Please ensure the extension is running. To install the extension, run /ide install.
+Here is a prioritized, read-only review of the STT API design and documentation, analyzing both the Specification and the Consensus Plan.
+
+### 1. REST + WebSocket API Design
+The single `Job` abstraction and `options` payload are pragmatic, but the design has several structural and ergonomic gaps:
+
+*   **REST Gaps:**
+    *   **Webhooks vs. Polling:** For 4-hour video transcriptions, polling `GET /v1/jobs/{id}` is a severe anti-pattern. An optional `webhook_url` in the `options` schema is critical.
+    *   **Idempotency & Pagination:** `POST /v1/jobs` for a 2GB file needs an `Idempotency-Key` header to prevent duplicate processing on transient network drops. Furthermore, `GET /v1/jobs` lacks pagination (cursors/offset), which will break the API after days of heavy usage.
+    *   **Result TTL:** The 7-day retention policy is noted, but the API response behavior on day 8 is undefined (should explicitly be `410 Gone`, not `404 Not Found`).
+*   **WebSocket (WS) Protocol:**
+    *   **Auth Handshake:** Browsers cannot send custom headers (`X-API-Key`) during a WS handshake. The docs must specify passing the key via query parameters (`?api_key=...`) or within the first JSON message payload.
+    *   **Codec Negotiation & Backpressure:** The WS schema lacks an explicit audio format declaration (e.g., sample rate, PCM16 vs. Opus). Additionally, while `429` handles REST queue overflow, WS backpressure is undefined (e.g., `{"type": "error", "reason": "buffer_full"}`).
+    *   **Reconnection:** There is no mechanism for a client to resume a dropped WS session without losing the `LocalAgreement` context buffer. A `session_id` is required for mid-stream resumption.
+
+### 2. Documentation Clarity & Spec/Plan Contradictions
+There is significant drift between the Spec and the Consensus Plan. An engineer implementing this will face contradictions that pose a deployment risk:
+
+*   **VRAM Sizing Drift (Critical):** The Spec estimates `large-v3 fp16` at ~6GB VRAM. The Plan correctly overrides this to **10GB** to account for conservative headroom and sequence length. The Spec must be updated to avoid engineers undersizing GPU instances.
+*   **Queue Architecture:** The Spec loosely suggests "RQ/Celery". The Plan definitively locks in **RQ `SimpleWorker` (no-fork)** because standard RQ/Celery `os.fork()` behaviors crash PyTorch CUDA contexts. If an engineer follows the Spec and uses Celery, the application will crash on boot.
+*   **Worker Model:** The Spec implies a standard web-worker pool. The Plan enforces a strict "load-once per worker process" architecture to avoid VRAM fragmentation. This constraint must be elevated in the Spec.
+
+### 3. Alternative Approaches
+*   **Queue Backend (Redis vs. SQLite):** While Redis/RQ is durable, it bloats the Docker and Colab footprint. **Alternative:** Since this is a local-first API running on a single box, using `taskiq` or `huey` backed by SQLite/file-system eliminates the Redis container entirely while maintaining durability.
+*   **Realtime Streaming:** Implementing custom `LocalAgreement-2` over `faster-whisper` (as planned) is notoriously brittle for edge cases (e.g., mid-word VAD slicing). **Alternative:** Adopt the C++ `whisper.cpp` streaming server natively via bindings, which handles VAD, context windowing, and memory stability much more efficiently than a custom Python implementation.
+*   **Model Weights Distribution:** Baking weights into Docker or downloading them synchronously on boot will cause timeouts. **Alternative:** Use an init-container or volume mount for weights.
+
+### 4. Edge-Case Usability
+*   **First-Run Download Penalty:** A `large-v3` model takes minutes to download. A REST request hitting the API during a cold boot will trigger a timeout. The API needs a `status: "downloading_model"` state.
+*   **Colab URL Rotation:** `cloudflared` Quick Tunnels are ephemeral and rotate frequently. If a client is polling a 4-hour job and the tunnel drops, the job is orphaned. The CLI should enforce ngrok auth-tokens or webhooks for long-running batch jobs.
+*   **Multi-language Auto-detect:** Passing `"auto"` language to the `turbo` model on a mixed KO/EN clip often results in the model locking onto English and hallucinating Korean phonetics. The `options` schema should support a prioritized language hint array, not just `"auto"`.
@@ -0,0 +1,266 @@
+# Consensus Implementation Plan: luke_scribe — 로컬 STT 전사 API
+
+- **Status:** `pending approval` (consensus **v2.2** — v2.1 합의 + CCG 외부리뷰(Codex/Gemini) 반영; §3.6 능력등급·§3.10 프로비저닝/WS/공유스토어/Colab)
+- **Mode:** `--consensus --direct --deliberate`
+- **Source spec:** `.omc/specs/deep-interview-luke-scribe-stt-api.md` (ambiguity ~10%, PASSED)
+- **Project:** greenfield `/root/luke_scribe`
+- **Generated:** 2026-06-02 · **Revised:** 2026-06-02 (v2)
+
+> **v2 changelog는 문서 맨 끝 §13 참조.** v1 대비 P0 3건(워커 실행모델, VRAM 측정/단계역전, 취소·임시파일·보관경합), P1 4건, P2 3건, 모호 AC 4건을 반영.
+
+---
+
+## 1. Requirements Summary
+
+내부용(비공개) 로컬 STT 전사 API. 단일 `Job` 추상화로 **배치(파일/영상)** 와 **실시간(WebSocket)** 입력 처리. faster-whisper(CTranslate2) 런타임, **하이브리드 모델**(실시간=turbo, 배치=large-v3) — 단 **P1 bench 게이트로 검증**. **Device Manager** 가 GPU/CPU를 감지하고 **부팅 시 VRAM 실측**으로 정밀도·워커수를 산정(1050~H100), `auto|cpu|cuda` 강제 가능. **Redis(RQ) 영속 큐**(배치) + **전용 실시간 핸들러**(WS), `queue_position`·`progress` 보고. 요청별 출력옵션(txt/ts/word/diarize/SRT/VTT). 후처리(glossary→rules→LLM(local/external, 기본 off·신뢰도 게이팅)→confidence). API Key 인증(+스코프), 결과 7일 보관·**파생 오디오 포함 즉시 삭제**, 4h/2GB → `413`, 큐 만재 → `429`. 배포: CLI(dev/Colab)+Docker(prod)+Colab cloudflared. diarization 옵션(pyannote).
+
+---
+
+## 2. RALPLAN-DR Summary
+
+### Principles
+1. **Hardware-adaptive, fail-explicit** — 1050~H100 자동 감지·정밀도/동시성 산정. 적재 불가 시 **모델/정밀도 강등 → CPU**로 우아하게 내려가되, 강등이 불가능하면 **명확한 오류로 거부**(조용한 OOM·무한 강등 금지). "never fail"이 아니라 "never fail silently".
+2. **One Job abstraction, two execution lanes** — 모든 입력을 Job 수명주기(queued→processing→completed/failed/cancelled)로 통일하되, **배치=RQ 워커 / 실시간=장수명 WS 핸들러**로 실행 레인을 분리(WS는 enqueue-once가 아니므로).
+3. **Accuracy/latency 분리(검증 기반)** — batch=large-v3, realtime=turbo. 단 **하이브리드 채택은 P1 bench의 측정 델타로 게이트**(불충분하면 단일 모델로 단순화).
+4. **Privacy-first, enforced** — 원본+파생 오디오 즉시 삭제(모든 종료 경로 `finally`), 결과 7d TTL, 외부 LLM egress는 **allowlist+opt-in+감사로그** 없이는 금지.
+5. **Dev/prod parity** — 동일 코어, CLI(dev/Colab)/Docker(prod)는 설정 차이. 큐는 prod=RQ / dev=in-proc 폴백이되 **동일 Job 인터페이스 뒤**에 두어 의미 동등성 유지.
+
+### Decision Drivers (top 3)
+1. **혼용어(KO+EN) 정확도** — hotwords + 모델 선택 + 후처리.
+2. **하드웨어 이식성 + 자동 스케일(실측 기반)**.
+3. **동시성 + 가시성(queue_position/progress)**.
+
+### Viable Options
+
+**D1. 큐/동시성 백엔드**
+- **(A) Redis + RQ `SimpleWorker`(no-fork) + 장수명 모델 보유 프로세스** ✅ *(채택, 사용자 Redis 확정 + fork 문제 회피)* — Pros: 영속·재시작 내성, 모델 1회 적재 후 재사용, CUDA-fork 충돌 회피. Cons: SimpleWorker는 작업 중 하트비트 없음 → progress emit로 보완 필요.
+- **(B) Redis + Celery** — Pros: 라우팅/우선순위/재시도 성숙. Cons: GPU 단일박스엔 과함. *Invalidation:* RQ+SimpleWorker로 충분.
+- **(C) in-process asyncio + GPU 세마포어** — dev/단발 폴백 + **P2 컨틴전시**(RQ/CUDA가 막히면 동일 Job 인터페이스로 폴백).
+> ⚠️ **근거:** RQ 기본 워커는 작업당 `os.fork()`하며, 부모에서 초기화된 CUDA 컨텍스트는 fork된 자식에서 재사용 불가([pytorch#40403](https://github.com/pytorch/pytorch/issues/40403)). 따라서 **fork 금지(SimpleWorker/장수명)** 가 필수. RQ progress는 내장이 없어 `job.save_meta()` 수동 호출 필요([RQ docs](https://python-rq.org/docs/jobs/)).
+
+**D2. 실시간 스트리밍 구현**
+- **(A) faster-whisper 위 커스텀 LocalAgreement-2** ✅ *(채택)* — Pros: 큐/디바이스/후처리 통합 제어. Cons: 안정화 정확성 직접 구현(난이도 과소평가 금지 → §3.7c 계약 명시). 지연 3~5초 관대는 *지연*만 완화하지 *정확성*은 아님.
+- **(B) WhisperLive 백엔드 vendoring** — 검증된 WS+VAD를 backend로 감싸기. *Invalidation:* 통합·하이브리드 제어가 우선이나 **청킹/안정화 휴리스틱은 차용**.
+- **(C) WhisperLiveKit(AlignAtt/SimulStreaming)** — 2025 SOTA. *Invalidation:* 3~5초 목표엔 과투자(P5 옵션).
+
+**D3. 추론 백엔드 추상화**
+- **(A) faster-whisper 단일 엔진 + compute_type 자동** ✅ — Pros: GPU/CPU/int8/fp16 단일 경로. Cons: 타 엔진 미지원(범위상 불필요).
+- **(B) 멀티 엔진 플러그인** — *Invalidation:* 조기 추상화 → 얇은 인터페이스(`engine/base.py`)만 두고 구현 1종.
+
+---
+
+## 3. Target Project Structure
+
+```
+luke_scribe/
+├── pyproject.toml (uv; extras: gpu, diarize, llm)   ├── run.sh   ├── .env.example
+├── docker/{Dockerfile.gpu, Dockerfile.cpu, docker-compose.yml}
+└── src/luke_scribe/
+    ├── config.py        # pydantic-settings (model rt/batch, device, precision, redis, retention, api_keys+scopes, tunnel, corrector+allowlist)
+    ├── cli.py           # typer: serve | transcribe | bench | detect
+    ├── api/{app.py, deps.py, schemas.py, routes/{jobs.py, stream.py, admin.py}}
+    ├── devices/{manager.py, profile.py, vram_probe.py}
+    ├── engine/{base.py, faster_whisper_engine.py, model_registry.py}
+    ├── audio/{ingest.py, vad.py}
+    ├── pipeline/{batch.py, realtime.py}
+    ├── jobqueue/{broker.py, jobs.py, worker.py, inproc.py, cancel.py}
+    ├── postprocess/{pipeline.py, glossary.py, rules.py, llm.py, confidence.py}
+    ├── diarization/pyannote_diarizer.py
+    ├── results/{store.py, formats.py, retention.py}
+    ├── connectivity/tunnel.py
+    └── observability/{logging.py, metrics.py}
+```
+
+### 3.5 Worker Execution Model & GPU Concurrency  *(P0-1 해소)*
+- **배치 레인:** RQ **`SimpleWorker`**(또는 장수명 커스텀 워커). 워커 부팅 시 `WhisperModel`을 **1회 적재**해 프로세스 수명 동안 보유(재적재·fork 금지). **GPU당 워커 프로세스 1개 기본**, 워커 내 GPU 접근은 단일 스레드(동시 decode 금지). 동시성 = **device-bound 워커 프로세스 수**(인터프로세스), `--workers` 오버라이드.
+- **실시간 레인:** WS 세션은 enqueue-once가 아니므로 RQ에 넣지 않고 **API 프로세스 내 장수명 turbo 핸들러**(asyncio + 단일 GPU 락)가 처리. 세션→Job 매핑은 상태 추적용으로만.
+- **하트비트/heartbeat 공백 보완:** SimpleWorker는 작업 중 하트비트가 없으므로 §3.7d의 throttled progress emit가 사실상 하트비트 역할(장시간 작업이 "멈춤"으로 오인되지 않게).
+- **컨틴전시/Colab:** GPU·RQ가 막히거나 **Colab(Docker 불가)** 이면 D1-C **in-proc 큐(Redis 불필요)** 로, 동일 Job 인터페이스 유지(§3.10d).
+
+### 3.6 VRAM Sizing & Capability Tier  *(P0-2 해소 + CCG ③: 능력 등급 자동판정)*
+- **부팅 시 실측(`devices/vram_probe.py`):** GPU VRAM(총/여유)·**RAM**·**디스크 여유** 감지 + 대상 모델을 1회 로드해 `allocated` 실측(헤드룸 ×1.3). **정적 상수 비의존.**
+- **보수 기본 상수(측정 전 폴백):** large-v3 fp16 ≈10GB·int8 ≈3.5GB, turbo fp16 ≈4GB·int8 ≈1.8GB.
+- **능력 등급(자동 — 무음 모델강등이 아니라 "제공 가능 모델"을 등급이 결정):**
+
+  | 등급 | 조건(실측) | 제공 |
+  |---|---|---|
+  | T0 CPU | GPU로 turbo도 무리/GPU 없음 | turbo@CPU |
+  | T1 turbo-GPU | turbo는 GPU OK, large-v3 무리 | turbo@GPU (large-v3 미제공 → 배치도 turbo) |
+  | T2 스왑 | large-v3 OK, turbo와 동시상주 불가 | 호출별 load/unload(MRU 상주, 스왑 최소화) |
+  | T3 동시상주 | turbo+large-v3 동시 적재 가능 | 둘 다 상주 → rt(turbo)+batch(large-v3) 동시 |
+  | ~~T4~~ | 모델 다중복제 | 제외 |
+
+- **정밀도:** cc≥7.0&free≥12GB→`float16`; cc≥7.0&free<12GB→`int8_float16`; Pascal(6.x)→`int8`; CPU→`int8`.
+- **워커수:** `workers = max(1, floor((free_VRAM − reserve) / measured_per_worker))` (§3.9b의 reserve = 헤드룸 + 실시간 모델 footprint).
+- **디스크 가드:** 다운로드 전 여유 공간 확인, 부족 시 명확 오류.
+- **투명성:** `/v1/system`·`/v1/models`로 등급·제공모델, 결과에 **`model_used`/`compute_type_used`** 항상.
+- **OOM 처리:** 강등 시도 ≤2회(fp16→int8→CPU) 후 실패(재큐 1회). **1050**: T1(turbo-int8)/T0. **T4~H100**: T3.
+
+### 3.7 Job Lifecycle: Cancellation, Temp Files, Retention  *(P0-3 해소)*
+- **(a) 협조적 취소:** `DELETE /v1/jobs/{id}` → Redis에 cancel 플래그. 워커는 faster-whisper **세그먼트 제너레이터를 소비하며 세그먼트 경계마다 플래그 확인**(유일한 선점 지점). 현재 세그먼트 연산은 완료 후 중단 → 상태 `cancelled`. (긴급 hard-kill은 워커 프로세스 종료 옵션으로만.)
+- **(b) 임시파일 수명:** ffmpeg 파생 wav는 추적 tempdir에 생성, **모든 종료 경로(success/fail/cancel/OOM)의 `finally`에서 삭제**. 업로드 원본도 전사 시작 시점 이후 보유하다 종료 시 삭제.
+- **(c) 실시간 LocalAgreement 계약:** 설정 명시 — `redecode_window`(예: 마지막 15s 오디오), `confirmed_prefix` 절단 규칙, `retained_left_context`(예: 5s), VAD 무음 경계에서 확정. 확정 세그먼트 방출 후 버퍼 절단(메모리 평탄). 단위 테스트로 버퍼 절단 불변식 검증.
+- **(d) progress 발행:** 워커가 제너레이터를 소비하며 `processed_sec/total_sec` 계산 → **throttled `job.save_meta()`**(N 세그먼트마다 또는 ≥1s). `total_sec`는 ingest 시 duration probe로 확보. `queue_position` = 레인 큐 인덱스.
+- **(e) 보관 sweeper 경합:** `results/retention.py`는 **터미널 상태(completed/failed/cancelled) Job만** 7d TTL 청소. {queued, processing} 보유 결과·임시물은 건드리지 않음.
+
+### 3.8 Security Boundary  *(P1-7 해소)*
+- **API Key + 스코프:** `X-API-Key` 검증 + `ApiKey.scopes` 강제(예: `transcribe`, `admin`). 키 회전/폐기 설정.
+- **외부 egress 통제:** LLM `external`/`openai` 백엔드는 **config allowlist 엔드포인트에만** 송신 가능(SSRF 방지), **기본 off + 명시 opt-in + 전송 1건당 감사 로그(key id, endpoint, job id)**. 옵션으로 전송 전 PII 마스킹.
+
+### 3.9 Shared-GPU Accounting & Realtime Concurrency  *(v2.1 — 합의 잔여조건)*
+- **(a) 이벤트 루프 비블로킹:** 실시간 turbo decode는 동기 CTranslate2 호출이므로 **`await loop.run_in_executor(single_thread_executor, decode)`** 로 오프로딩(단일 GPU 락 직렬성 유지). P3 착수 시 CT2의 GIL 해제 여부 검증 — 미해제면 실시간 레인을 **별도 디코드 프로세스**로 분리. *(NEW-1a)*
+- **(b) 공유 GPU VRAM 회계:** `reserve = base_headroom + (realtime_enabled ? measured_realtime_vram : 0)`. 실시간 모델이 동일 GPU 상주 시 배치 워커수 공식(§3.6)이 그 footprint를 반드시 포함 → 단일 GPU + 실시간 동시 활성에서 oversubscribe 금지. *(NEW-1b)*
+- **(c) 실시간 동시 세션:** turbo 인스턴스 1개를 전 세션이 **직렬 공유**(단일 GPU 락). **최대 동시 WS 세션 상한**(설정) + 초과 시 거부/대기, 실시간 레인 대기시간 메트릭. AC-8(≤5s)은 "≤N 세션 한도 내" 보장으로 명시. *(R3)*
+- **(d) 실효 compute_type 로깅:** `vram_probe`/`/v1/system`이 **요청 vs 실효 compute_type** 보고, 불일치(T4 `int8_float16` 무음 강등 등) 경고. AC-2/3 계약에 포함. *(NEW-2/R2)*
+- **(e) RQ job_timeout:** enqueue 시 `job_timeout ≥ 4h(+마진)`(duration probe 기반). RQ 기본 180s로는 장시간 작업이 3분에 강제 종료되어 AC-7 위반 → 반드시 상향. *(R1/NEW-3)*
+- **(f) Phase Exit 구속력:** 각 Phase Exit는 **hard gate**(미충족 시 다음 Phase 착수 금지). 단 옵션 기능(diarization/LLM/tunnel) 미완은 "문서화된 제한"으로 soft 허용. *(R4)*
+
+---
+
+### 3.10 CCG 리뷰 반영 — 프로비저닝·WS·공유스토어·실행 프로파일  *(v2.2)*
+- **(a) 모델 프로비저닝(오프라인 강제 폐기):** 시작 시 모델 존재 확인 → 없으면 HF 다운로드(인터넷 OK) → **체크섬/로드 실패 시 캐시 purge 후 1회 재다운로드** → 준비 전 API는 `503`/`status:"loading"`. ("로컬 실행"=외부 STT API 미사용이지 에어갭 아님 — 스펙 "오프라인" 문구 폐기.)
+- **(b) WS init-frame 인증:** 연결 직후 첫 메시지 `{type:"init", api_key, audio:{codec,sample_rate,channels}, options}`, 2초 내 미수신 시 close. 헤더 못 쓰는 브라우저 대응 + 키가 URL/로그에 안 남음. 코덱/샘플레이트도 이 프레임에서 협상(Gemini WS 지적 흡수).
+- **(c) 공유 스토어(#1, 프로덕션 Docker 한정):** api↔worker가 별도 컨테이너 → **입력 원본·파생 wav·결과를 공유 볼륨/오브젝트 스토어**, **Job 메타는 Redis**. per-container 로컬 SQLite 금지. **Colab/단일프로세스는 디스크 1개라 무관.**
+- **(d) 실행 프로파일 2종(코드 동일):** Colab/개발=순수 Python(`run.sh`/`python -m ...`)·**in-proc 큐(Redis 불필요)**·로컬 디스크·**cloudflared 바이너리**. 프로덕션=Docker+Redis+worker+공유 스토어. *(Colab은 Docker 불가가 하드 제약.)*
+- **(e) 언어/hotwords:** 기본 `language="ko"`(요청별 override). **hotwords는 선택**(반복 고정용어 1회 등록; 매 전사 예측 아님) — 기본 경로는 ko 앵커+모델+rules 후처리, P1 bench로 hotwords 불필요 여부 실측.
+- (미채택·추후: webhook, Idempotency-Key, 페이지네이션, 만료 `410`.)
+
+---
+
+## 4. Implementation Steps (by phase, with file refs & exit criteria)
+
+### P1 — Core + 측정 게이트
+1. **스캐폴딩** — `pyproject.toml`(extras), `config.py`, `run.sh`.
+2. **Device Manager + VRAM probe** — `devices/{manager.py, vram_probe.py, profile.py}`: 감지 + §3.6 실측·정밀도·워커수·Model-Fit 분기. AC-2/3.
+3. **Engine + registry** — `engine/faster_whisper_engine.py`(transcribe: hotwords/initial_prompt/word_ts/vad), `model_registry.py`(rt=turbo/batch=large-v3, 오버라이드). AC-4.
+4. **Audio ingest(스트리밍)** — `audio/ingest.py`: **ffmpeg를 파일로 파이프**(전체 배열 인메모리 금지), duration/size probe, 4h/2GB→`413`. `audio/vad.py`: Silero VAD. AC-7/9.
+5. **CLI `detect`/`transcribe`/`bench`** — `bench`를 **P1로 전진**: 도메인 KO+EN 클립으로 turbo vs large-v3 **R-WER + entity 보존율 + 속도 + 실측 VRAM** 측정 → 하이브리드 게이트 판정. AC-4/12.
+- **Exit:** CPU와 실 GPU 1종에서 단일 파일 전사 성공; `detect`가 measured VRAM·정밀도·워커수 출력; `bench`가 모델 델타 리포트 산출.
+
+### P2 — API + Queue
+6. **FastAPI + 인증** — `api/{app.py, deps.py, schemas.py}`: API Key+스코프(§3.8). AC-11.
+7. **RQ 워커(SimpleWorker)** — `jobqueue/{broker.py, jobs.py, worker.py, cancel.py, inproc.py}`: §3.5 실행모델, §3.7d progress, §3.7a 취소, §3.6 OOM 강등. AC-5/6.
+8. **Jobs 라우트** — `api/routes/jobs.py`: `POST /v1/jobs`(만재→`429`), `GET`(queue_position/progress), `result?format=`, `DELETE`(취소), `GET /v1/jobs`. AC-1/6.
+9. **Results + retention** — `results/{store.py, formats.py, retention.py}`: §3.7b/e, 원본·파생 삭제, 7d TTL 터미널만. AC-11.
+10. **Docker** — `Dockerfile.{gpu,cpu}` + compose(api+redis+worker). **CT2/CUDA/cuDNN 트리플 핀**(예: CUDA12+cuDNN9; 구형/Colab은 CT2 다운그레이드 경로 문서화). `detect`가 런타임 CUDA 버전 노출. AC-12.
+11. **Admin** — `/health`, `/v1/system`(profile/워커/큐깊이/가용 VRAM), `/v1/models`.
+- **Exit:** 다건 enqueue→progress→result; **강제 OOM 시 강등·실패 경로 시연**; 취소가 `cancelled`로 종료 + 임시파일 삭제 확인.
+
+### P3 — Realtime
+12. **실시간** — `pipeline/realtime.py`(§3.7c 계약: redecode_window/prefix/left-context/VAD), `api/routes/stream.py`(WS, partial/final/status, 백프레셔). AC-8.
+- **Exit:** 부분결과 ≤5s + 최종 안정화; **30분 세션 메모리 평탄**(±15% 이내, 단조증가 없음).
+
+### P4 — Output + Post-processing
+13. **출력 옵션** — timestamps/word/SRT/VTT 요청별. AC-9.
+14. **후처리(glossary/rules/flag)** — `postprocess/{pipeline.py, glossary.py, rules.py, confidence.py}`. AC-10.
+- **Exit:** glossary on/off diff로 entity 보존 향상 측정; 저신뢰 플래그 부착.
+
+### P5 — Advanced
+15. **LLM 보정(옵션)** — `postprocess/llm.py`: local/external 백엔드, §3.8 egress 통제, confidence-gated, 기본 off. AC-10.
+16. **Diarization(옵션)** — `diarization/pyannote_diarizer.py`(HF 토큰).
+17. **Colab 터널** — `connectivity/tunnel.py`: **API lifespan에 종속 supervise**(같이 start/stop), URL 회전 시 재출력, 임시성 명시. AC-13.
+18. **관측/벤치 확장** — `observability/{logging,metrics}.py`(큐깊이·워커가동·RTF·OOM 카운트), `bench` 확장.
+- **Exit:** external egress allowlist+감사로그 동작; Colab `serve --tunnel cloudflare` 외부 200.
+
+---
+
+## 5. Acceptance Criteria (수치화)
+
+스펙 AC-1~13 상속 + **모호 항목 절대 기준화**:
+- **AC-2/3:** `detect`가 measured VRAM 기반 정밀도/워커수 산정 + **능력 등급(T0~T3) 자동판정·`/v1/models` 제공모델·`model_used` 표시**(§3.6); 1050→int8/CPU(large-v3 GPU 불가), T4→int8_float16, A100/H100→fp16.
+- **AC-4(혼용어, 절대):** 도메인 entity 용어(vLLM·API·FastAPI·Kubernetes·LLM·GPU 등) **verbatim 보존율 ≥ 95%**(hotwords on, domain set), 그리고 도메인 **R-WER ≤ {P1 bench 기준선}**. 보조로 batch(v3) ≤ realtime(turbo).
+- **AC-5/6:** 다건 동시 → 각 `GET`이 `queue_position`(앞 N건)·`progress %`(processed_sec/total_sec); 만재 `429`.
+- **AC-7(메모리, 절대):** 2h 배치 + 30분 WS에서 **워밍업 후 peak RSS 변동 ±15% 이내, 단조 증가 없음**; OOM 없음; 4h/2GB→`413`.
+- **AC-8:** 부분결과 ≤5s(동시 ≤N 세션 한도 내), 최종 안정화; 다중 세션 부하에서도 REST/배치 응답성 유지(이벤트 루프 비블로킹, §3.9a).
+- **AC-11:** 전사 후 원본+파생 wav 부재; 결과 7d 후 만료; 터미널 Job만 청소.
+
+---
+
+## 6. Risks and Mitigations
+| Risk | Mitigation |
+|------|-----------|
+| CUDA fork 실패 | SimpleWorker/장수명·model-load-once, fork 금지(§3.5) |
+| OOM(동시성) | **측정 VRAM**(§3.6) 기반 워커수, semaphore, 강등 최대 2회 후 실패 |
+| turbo 혼용어 부족 | P1 bench 게이트(§4-5), 하이브리드/hotwords/후처리, 모델 스왑 |
+| 실시간 떨림/누수 | LocalAgreement 계약(§3.7c), 버퍼 절단 불변식 테스트 |
+| 취소 no-op / 임시파일 누출 | 협조적 취소 + `finally` 삭제(§3.7a/b) |
+| 보관 경합 | 터미널 상태만 sweeper(§3.7e) |
+| 외부 LLM PII 유출 | allowlist+opt-in+감사로그+마스킹(§3.8) |
+| CT2/cuDNN 버전 불일치 | Dockerfile 트리플 핀 + 다운그레이드 경로(P2-10) |
+| Redis SPOF | in-proc 폴백(D1-C), 헬스체크 |
+
+## 7. Pre-mortem (deliberate — 3 시나리오, 위험표와 비중복)
+1. **"P2 통합에서 워커가 fork-CUDA로 즉시 죽었다."** 원인: RQ 기본 fork. 예방: §3.5 SimpleWorker 강제 + P2 Exit에 "실 GPU 워커 기동" 게이트, P1에서 워커 기동 스파이크 선검증.
+2. **"데모에서 'vLLM'→'브이엘엘엠'으로 신뢰 상실."** 원인: 모델/hotwords 미검증. 예방: **P1 bench 게이트**(entity 보존율 ≥95% 기준), 기본 hotwords 사전 동봉, 미달 시 v3 채택 자동 판정.
+3. **"30분 WS 세션에서 메모리 누수로 컨테이너 OOM-kill."** 원인: 링버퍼/가설 미절단. 예방: §3.7c 절단 계약 + 버퍼 불변식 단위테스트 + P3 Exit의 ±15% 메모리 게이트.
+
+## 8. Expanded Test Plan (deliberate, 수치 게이트 포함)
+- **Unit:** Device 결정(cc/measured-VRAM→compute_type/workers/Model-Fit 분기), OOM 강등 캡(≤2), LocalAgreement 버퍼 절단 불변식, formats(srt/vtt 타임코드), rules 정규화, retention(터미널만), options(413/429).
+- **Integration:** ffmpeg(스트리밍, 인메모리 아님)→engine→result, Redis enqueue→SimpleWorker→progress(save_meta throttle)→status, 취소 플래그→`cancelled`+임시파일 삭제, glossary diff(entity 보존율), egress allowlist 차단/허용.
+- **E2E:** 파일 Job 수명주기(생성→progress→결과·원본+파생 삭제), 동시 다건 queue_position, WS 세션(부분≤5s), Colab `serve --tunnel cloudflare` 200, 강제 OOM 강등 경로.
+- **Observability:** 메트릭(큐깊이·워커가동률·RTF·OOM 카운트·**실시간 레인 대기시간**), 구조적 로그(job_id 상관), `/health`·`/v1/system` 계약(**요청 vs 실효 compute_type 보고 검증**), **2h 배치/30분 WS RSS ±15% 평탄 검증**, external egress 감사로그 1건/전송, **다중 세션 AC-8(≤5s) + 동시 `GET /v1/jobs` 응답성 바운드**, **>180s 작업이 job_timeout으로 강제종료되지 않음**, **공유 GPU(배치+실시간) VRAM 비-oversubscribe Model-Fit 상호작용 테스트**.
+
+## 9. Verification Steps
+1. `uv sync && uv run python -m luke_scribe.cli detect` → measured VRAM·정밀도·워커수.
+2. `... bench --samples samples/ko_en/` → turbo vs v3 R-WER·entity 보존율·VRAM → 하이브리드 판정.
+3. `... transcribe samples/ko_en.wav` → entity verbatim ≥95% 확인.
+4. `docker compose up` → 다건 `POST`/`GET`(progress/위치), 결과 후 원본+파생 부재, `DELETE`→`cancelled`.
+5. 강제 작은-VRAM 환경 → 강등(≤2)·실패 경로 시연.
+6. WS 클라이언트 30분 → 부분≤5s, RSS ±15%.
+7. `pytest tests/`(unit/integration) + e2e 스모크.
+8. Colab → `serve --tunnel cloudflare` URL 외부 200; egress allowlist 차단 테스트.
+
+## 10. ADR (refined)
+- **Decision:** faster-whisper 단일 엔진 + (게이트된)하이브리드 모델 + Redis(RQ **SimpleWorker**, no-fork, model-load-once) 배치 + 전용 WS 실시간 핸들러 + **측정 기반 DeviceProfile** + CLI/Docker(버전 핀).
+- **Drivers:** 혼용어 정확도, 이식성/자동스케일(실측), 동시성/가시성.
+- **Alternatives:** Celery(과함), in-proc(폴백/컨틴전시), WhisperLive/Kit(참고/P5), 멀티엔진(조기), RQ 기본 fork(CUDA 불가→거부).
+- **Why chosen:** 사용자 확정(하이브리드/Redis/후처리/보관/상한/diarize) 준수 + fork/VRAM/취소 리스크를 명시 메커니즘으로 해소.
+- **Consequences:** SimpleWorker 하트비트 공백→progress로 보완; 실시간 안정화 자체 구현; 1050 배치 large-v3 GPU 불가; Redis 의존; 공유 GPU에서 배치+실시간 VRAM 합산 회계(§3.9b)·실시간 decode 오프로딩(§3.9a) 필요.
+- **Follow-ups:** Redis HA, SimulStreaming(P5), egress 마스킹 고도화. (도메인 bench는 follow-up이 아니라 **P1 게이트**로 승격.)
+
+---
+
+## 11. Open Questions (사용자 확인 권장)
+1. 실제 배포가 **다중 GPU 워커**를 필요로 하나, 아니면 단일 T4/Colab 위주? (후자면 워커수 공식 위험 축소.)
+2. turbo의 KO entity 보존율이 P1 bench에서 ≥95%면 **단일 모델로 단순화**할 의향이 있는지(VRAM/복잡도 절감).
+3. 취소는 **협조적(세그먼트 경계)** 으로 충분한지, 즉시 hard-kill이 필요한지.
+
+---
+
+## 12. v2 Changelog (적용 내역 → 리뷰 매핑)
+- **[P0-1]** §3.5 Worker Execution Model 신설 — SimpleWorker/no-fork/model-load-once, 배치·실시간 레인 분리, in-proc 컨틴전시. (Arch P0-1, Critic F1)
+- **[P0-2]** §3.6 VRAM 부팅 실측 + 보수 상수(large-v3 fp16 10GB) + Model-Fit 분기 + `bench`를 **P1로 전진**(단계 역전 수정). (Arch P0-2/P1-4, Critic F2)
+- **[P0-3]** §3.7 협조적 취소 + 임시파일 `finally` 삭제 + 보관 sweeper 터미널-한정. (Arch P0-3/P1-6, Critic F3)
+- **[P1]** §3.7d progress 메커니즘(save_meta throttle), §3.8 보안 egress allowlist+감사+스코프, 하이브리드 P1 게이트. (Arch P1-4/5/7, Critic F4/F5)
+- **[P1-신규]** §4 단계별 **Exit Criteria** + in-proc 컨틴전시 명시. (Critic F6)
+- **[P2]** §3.7c LocalAgreement 계약, §4-4 ffmpeg 스트리밍 보장, §4-10/17 cloudflared lifecycle + CT2/CUDA/cuDNN 핀. (Arch P2-8/9/10)
+- **[AC]** AC-4 절대 기준(entity 보존율 ≥95% + R-WER 기준선), AC-7 메모리 ±15% 평탄, 워커당_추정=측정값, OOM 강등 ≤2회, `429`/`413` 분리. (Critic 모호성 4건)
+- **[Risk]** OOM 위험을 측정 상수 기반으로 재작성(순환성 제거); pre-mortem #1을 위험표와 비중복화하고 예방을 P1/P2 게이트에 배선.
+
+---
+
+## 13. v2.1 Changelog (합의 잔여조건 반영)
+- **[NEW-1a]** §3.9a 실시간 decode `run_in_executor` 오프로딩(+GIL 검증/프로세스 분리 폴백).
+- **[NEW-1b]** §3.9b `reserve`에 실시간 모델 VRAM 포함 → 공유 GPU oversubscribe 방지.
+- **[R1/NEW-3]** §3.9e RQ `job_timeout ≥ 4h`.
+- **[NEW-2/R2]** §3.9d 요청 vs 실효 compute_type 로깅 + AC-2/3 계약 + 관측 테스트.
+- **[R3]** §3.9c 실시간 최대 동시 세션 상한 + 대기시간 메트릭 + 다중세션 AC-8 테스트.
+- **[R4]** §3.9f Phase Exit 구속력(hard/soft) 명시.
+
+## 합의 결과 (Consensus Outcome)
+- **Architect (v2): APPROVE WITH CONDITIONS** — P0×3 해소 확인, 잔여 NEW-1(a/b)/2/3.
+- **Critic (v2): APPROVE WITH CONDITIONS** — CRITICAL×3·MAJOR×3 해소 확인, NEW-1 검증 + R1~R4.
+- 모든 잔여 조건을 **v2.1**에 반영 → **합의 도달**. iteration 2/5.
+
+---
+## 14. v2.2 Changelog — CCG 외부 리뷰(Codex/Gemini) 반영
+- **②→프로비저닝:** "오프라인 강제" 폐기. 다운로드 OK + 손상 재다운로드 + `loading` 상태(§3.10a). 스펙 제약 문구 수정.
+- **③→능력 등급:** §3.6을 T0(CPU)~T3(동시상주) 자동판정으로 재작성, T4(다중복제) 제외, `model_used` 항상 표시(무음 모델강등 폐기).
+- **④→WS init 인증:** §3.10b 첫 메시지 인증+코덱 협상. hotwords는 선택으로 강등(§3.10e).
+- **⑤→기본 `ko`:** 요청별 override(§3.10e).
+- **#1→공유 스토어:** 프로덕션 Docker 한정 공유 볼륨/오브젝트 스토어(§3.10c); Colab 무관.
+- **Colab:** Docker 불가 → 순수 Python·in-proc·바이너리 cloudflared(§3.5, §3.10d).
+- 출처: `.omc/artifacts/ask/{codex,gemini}-20260603-095739.md`. 미채택(추후): webhook·Idempotency-Key·페이지네이션·`410`.
+
+---
+*Consensus v2.2 — `pending approval`. 실행(team/ralph/autopilot)은 사용자의 별도 명시 승인이 있어야만 진행됩니다. 승인 전 소스 수정·커밋·실행 스킬 호출 없음.*
@@ -0,0 +1,34 @@
+{
+  "version": "1.0.0",
+  "lastScanned": 1780794206309,
+  "projectRoot": "/root/luke_scribe",
+  "techStack": {
+    "languages": [],
+    "frameworks": [],
+    "packageManager": null,
+    "runtime": null
+  },
+  "build": {
+    "buildCommand": null,
+    "testCommand": null,
+    "lintCommand": null,
+    "devCommand": null,
+    "scripts": {}
+  },
+  "conventions": {
+    "namingStyle": null,
+    "importStyle": null,
+    "testPattern": null,
+    "fileOrganization": null
+  },
+  "structure": {
+    "isMonorepo": false,
+    "workspaces": [],
+    "mainDirectories": [],
+    "gitBranches": null
+  },
+  "customNotes": [],
+  "directoryMap": {},
+  "hotPaths": [],
+  "userDirectives": []
+}
@@ -0,0 +1,401 @@
+# Deep Interview Spec: luke_scribe — 로컬 STT 전사 API 시스템
+
+> 내부용(비공개) 음성/영상 → 텍스트 전사 API. 로컬 모델 실행, GPU/CPU 자동·수동 선택,
+> 실시간(WebSocket) + 배치(파일/영상), 작업 큐·진행률, 혼용어 대응, 후처리, Colab 자동 노출.
+
+## Metadata
+- Interview ID: `di-luke-scribe-stt-20260602`
+- Rounds: 3 (스코어링) + 추가 아이디어 1 + 열린 결정 확정 1
+- Final Ambiguity Score: **~10%** (threshold 20%; 열린 결정 6건 확정 후)
+- Type: **greenfield** (빈 저장소 `luke_scribe`)
+- Generated: 2026-06-02
+- Threshold: 0.2 / Threshold Source: `default`
+- Initial Context Summarized: no
+- Status: **PASSED · 결정 확정 완료 · CCG 외부리뷰 반영(v2.2)**
+
+## Clarity Breakdown
+| Dimension | Score | Weight | Weighted |
+|-----------|-------|--------|----------|
+| Goal Clarity | 0.94 | 0.40 | 0.376 |
+| Constraint Clarity | 0.90 | 0.30 | 0.270 |
+| Success Criteria | 0.86 | 0.30 | 0.258 |
+| **Total Clarity** | | | **0.904** |
+| **Ambiguity** | | | **0.096 (~10%)** |
+
+---
+
+## Topology (확정 컴포넌트)
+
+| # | Component | Status | 설명 | 커버리지 |
+|---|-----------|--------|------|----------|
+| 1 | **Ingestion API** | active | 실시간 스트림(WebSocket) + 파일/영상 업로드 수집 | AC-1, AC-7, AC-9 |
+| 2 | **Transcription Engine** | active | 로컬 STT(faster-whisper), **하이브리드: 실시간=turbo / 배치=large-v3** | AC-4 |
+| 3 | **Realtime Pipeline** | active | VAD·청크·부분/최종 결과 스트리밍 | AC-8 |
+| 4 | **Output / Results** | active | 요청별 출력옵션(txt/ts/word/diarize/SRT/VTT), 결과 보관(7일) | AC-9, AC-11 |
+| 5 | **Job Queue / Concurrency** (1급) | active | Job 추상화, **Redis 영속 큐**, 워커풀, 우선순위 레인, queue_position·진행률 | AC-5, AC-6 |
+| 6 | **Device Manager** (횡단) | active | GPU/CPU 자동감지 → 정밀도·워커수·동시성 자동 산정, 강제 플래그 | AC-2, AC-3 |
+| 7 | **Post-processing** | active | glossary/rules + (옵션)LLM 보정(백엔드 설정화) + confidence 플래그 | AC-10 |
+| 8 | **Connectivity / Tunnel** | active | Colab 등 공인 IP 없는 환경 자동 외부 노출(cloudflared 등) | AC-13 |
+
+---
+
+## Goal
+
+**내부 서비스가 호출하는 비공개 API로, 실시간 음성·녹음 파일·mp3·mp4(및 기타 영상)를 입력받아 로컬에서 실행되는 STT 모델로 텍스트로 전사한다.** 실시간 입력은 준실시간(3~5초 내 부분 결과)으로 전사한다. 모델은 감지된 하드웨어(GPU/CPU)에 맞춰 정밀도·동시성을 자동 결정하되 `auto | cpu | cuda` 강제 선택도 가능하다. 다수 작업을 동시/대기열로 처리하고, 호출자는 대기열 위치와 진행률을 조회할 수 있다. 한국어 중심이되 한·영 혼용 기술용어(예: "API", "vLLM")를 음차로 망가뜨리지 않고 정확히 전사한다. **정확도가 중요한 배치는 large-v3, 저지연이 중요한 실시간은 turbo**로 분리한다(하이브리드).
+
+---
+
+## Constraints (제약)
+
+- **로컬 실행(STT).** 외부 STT API(구글/AWS 등) 의존 금지 — Whisper를 우리 하드웨어에서 직접 실행. 모델 가중치는 **미존재 시 HuggingFace에서 자동 다운로드(인터넷 OK), 손상 시 재다운로드**. *에어갭/오프라인 강제는 요구사항 아님.*
+- **하드웨어 폭이 매우 넓음:** 개발=GTX 1050(Pascal, 2~4GB), 테스트=Colab/T4/L4/A100/H100. → 고정 수치 설정 불가, **자동 산정 필수**.
+- **정밀도 자동 선택:** compute capability ≥ 7.0 → fp16, Pascal(6.x) → int8, VRAM 부족 → CPU 폴백, CPU → int8.
+- **동시성/워커 수는 감지된 VRAM·코어로 자동 산정**(오버라이드만 허용).
+- **모델 하이브리드:** 실시간=turbo, 배치=large-v3 (둘 다 설치, `model` 오버라이드 가능).
+- **언어:** 한국어 우선 + 자동 감지, 한·영 혼용(code-switching) 정확도가 하드 요구.
+- **실시간 전송:** WebSocket. 목표 지연 3~5초(관대) → 정확도 우선 청킹 가능.
+- **인증:** API Key 헤더(내부용).
+- **큐:** **Redis 영속 큐(RQ, no-fork)** — 재시작 내성·다중 워커. **Colab/개발은 in-process 폴백(Redis 불필요)**.
+- **보관:** 결과/메타만 **7일** 보관(설정화·자동삭제), **업로드 원본 오디오는 처리 후 즉시 삭제**.
+- **파일 상한:** 모든 입력 **비동기 Job 기본**, 절대 상한 **4시간 / 2GB**(초과 `413`, 설정화).
+- **배포 이원화:** CLI(셸 스크립트)=개발·테스트·Colab / Docker(FastAPI/Python)=프로덕션(내부).
+- **CPU 폴백은 항상 지원.**
+
+## Non-Goals (명시적 비범위)
+
+- 외부 공개(public) API·과금·멀티테넌시 SaaS 기능.
+- 자체 STT 모델 학습/파인튜닝(기성 Whisper 계열 사용).
+- 번역(translation) — 1차 범위 외.
+- 프런트엔드 UI(API/CLI만 제공).
+- 영구 원본 오디오 아카이빙(원본은 삭제가 기본).
+
+---
+
+## Acceptance Criteria (검증 가능 기준)
+
+- [ ] **AC-1** 동일 시스템으로 파일(오디오/영상)·실시간(WebSocket) 입력을 모두 전사한다.
+- [ ] **AC-2** `device=auto`가 GTX 1050에서 int8/CPU로, T4/L4/A100/H100에서 fp16로 자동 동작하고, `cpu`/`cuda[:n]` 강제 플래그가 동작한다.
+- [ ] **AC-3** 정밀도·워커 수가 감지된 VRAM/compute capability로 자동 산정되며 `--workers`/`--compute-type` 오버라이드가 가능하다.
+- [ ] **AC-4** 혼용어 검증: *"그 API 서빙할 때 vLLM 쓰면 성능 대박이야"* 입력 시 "API", "vLLM"이 영문 그대로(핫워드 적용 시) 전사된다. 배치 경로(large-v3)에서 정확도가 더 높음을 확인한다.
+- [ ] **AC-5** 동시 다중 작업을 받아 Redis 큐에 적재/동시 처리하며, 작업 중에도 신규 입력을 계속 수신한다.
+- [ ] **AC-6** 호출자가 `queue_position`(앞 N건)과 `progress`(처리된 길이/전체, %)를 조회할 수 있다.
+- [ ] **AC-7** 장시간/대용량 파일이 VAD 세그먼트로 분할되어 진행률을 제공하고 메모리 사용이 일정하다. 4h/2GB 초과는 `413`.
+- [ ] **AC-8** 실시간 부분 결과가 3~5초 내 스트리밍되고 최종 결과로 안정화된다(turbo 경로).
+- [ ] **AC-9** 영상 파일이 ffmpeg로 오디오 추출 후 전사되고, 출력 옵션(timestamps/word/diarize/formats)이 요청별로 동작한다.
+- [ ] **AC-10** 후처리: glossary/rules가 동작하고, LLM 보정(백엔드 `local`/`external` 설정화, 기본 off·신뢰도 게이팅)과 저신뢰 구간 플래그가 동작한다.
+- [ ] **AC-11** API Key 인증이 적용되고, 전사 완료 후 원본 오디오가 삭제되며 결과만 7일 보관된다.
+- [ ] **AC-12** CLI(`serve`/`transcribe`/`bench`/`detect`)와 Docker(GPU/CPU 이미지 + Redis)로 각각 실행된다.
+- [ ] **AC-13** Colab에서 `--tunnel cloudflare`로 공개 URL이 자동 발급되어 외부에서 호출된다.
+
+---
+
+## Architecture (상세 설계)
+
+### 시스템 개요도
+
+```
+                         ┌───────────────────────────────────────────────┐
+  내부 호출자             │                luke_scribe API                 │
+  (서비스/CLI)            │                                                │
+        │                │  ┌──────────────┐     ┌────────────────────┐  │
+  REST  ├──── 파일/영상 ─▶│  │ Ingestion API│────▶│  Job Queue (Redis)  │ │
+  (HTTP)│                │  │ (FastAPI)    │     │  - priority lanes   │  │
+        │                │  │  - upload    │     │   (realtime/batch)  │  │
+  WS    ├── 실시간 오디오▶│  │  - WS stream │     │  - queue_position   │  │
+        │                │  │  - auth(API  │     │  - progress         │  │
+        │                │  │    Key)      │     │  - durable/재시작내성 │ │
+        │                │  └──────┬───────┘     └─────────┬──────────┘  │
+        │                │         │ ffmpeg(영상→오디오)    │ dispatch     │
+        │                │         ▼                        ▼             │
+        │                │  ┌──────────────┐     ┌────────────────────┐  │
+        │                │  │ Realtime     │     │  Worker Pool        │  │
+        │                │  │ Pipeline     │     │  (N = 자동산정)      │  │
+        │                │  │ VAD→chunk→   │◀───▶│  ┌───────────────┐  │  │
+        │                │  │ partial/final│     │  │ Engine        │  │  │
+        │                │  │ (turbo)      │     │  │ faster-whisper│  │  │
+        │                │  └──────┬───────┘     │  │ rt=turbo      │  │  │
+        │                │         │             │  │ batch=large-v3│  │  │
+        │                │         ▼             │  └──────┬────────┘  │  │
+        │                │  ┌──────────────┐     │         │           │  │
+        │                │  │Post-processing│◀───┤         │ uses       │  │
+        │                │  │glossary/rules │    └─────────┼───────────┘  │
+        │                │  │+LLM(opt,plug) │              │              │
+        │                │  │+conf flag     │    ┌─────────▼──────────┐  │
+        │                │  └──────┬───────┘     │  Device Manager     │  │
+        │                │         ▼             │  GPU/CPU 감지 →      │  │
+        │◀── 결과/진행률 ─┤  ┌──────────────┐    │  fp16·int8·CPU /     │  │
+        │   (txt/srt/    │  │Output/Results│    │  worker수·동시성     │  │
+        │    vtt/json)   │  │store(7일,결과)│    └────────────────────┘  │
+        │                │  └──────────────┘                              │
+        │                │  Connectivity/Tunnel (Colab→cloudflared 자동)  │
+        │                └───────────────────────────────────────────────┘
+```
+
+### 1) Ingestion API (입력/수집)
+
+**REST (배치/파일):**
+| Method | Path | 설명 |
+|--------|------|------|
+| `POST` | `/v1/jobs` | multipart: `file`(오디오/영상) + `options`(JSON). → `{job_id, status:"queued", queue_position}` |
+| `GET` | `/v1/jobs/{id}` | 상태 조회: `queued`(queue_position, jobs_ahead) / `processing`(progress %, processed_sec/total_sec, eta) / `completed` / `failed`(error) |
+| `GET` | `/v1/jobs/{id}/result?format=txt\|srt\|vtt\|json` | 결과 조회(포맷 변환) |
+| `DELETE` | `/v1/jobs/{id}` | 작업 취소 |
+| `GET` | `/v1/jobs` | 작업 목록/필터 |
+
+**WebSocket (실시간):** `WS /v1/stream`
+- 1) **init 프레임(첫 메시지=인증):** `{type:"init", api_key, audio:{codec,sample_rate,channels}, options:{language:"ko", ...}}` — 2초 내 유효 init 없으면 close. *브라우저는 WS 핸드셰이크에 헤더를 못 넣으므로 인증은 이 첫 메시지로.*
+- 2) 클라이언트 → 오디오 청크(PCM16/opus 등) 연속 전송
+- 3) 서버 → `{type:"partial", text, t0,t1}`(가설) / `{type:"final", segment, start, end, words[]}`(확정) / `{type:"status", ...}`
+
+**Admin/관측:** `GET /health`, `GET /v1/system`(device 프로파일·워커수·큐 깊이), `GET /v1/models`.
+
+**인증:** REST = `X-API-Key` 헤더. **WS = 첫 `init` 메시지의 `api_key`**(헤더 못 쓰는 브라우저 대응; 키가 URL/로그에 안 남음). 키별 스코프/사용량 확장 여지.
+
+**요청 옵션 스키마(`options`):**
+```jsonc
+{
+  "language": "ko",                     // 기본 ko(한국어 우선). "auto"|"en"|"ja"... 요청별 override
+  "model": null,                         // null=경로별 기본(rt=turbo, batch=large-v3). 오버라이드 가능
+  "device": "auto",                      // "auto" | "cpu" | "cuda" | "cuda:0"
+  "compute_type": null,                  // null=자동. "float16"|"int8"|"int8_float16"
+  "timestamps": true,                    // 세그먼트 타임스탬프
+  "word_timestamps": false,              // 단어 단위
+  "diarize": false,                      // 화자 분리(pyannote, opt, HF 토큰)
+  "formats": ["json"],                   // ["txt","srt","vtt","json"]
+  "hotwords": [],                       // (선택) 반복되는 고정 도메인 용어만 1회 등록. 미예측 — 비우면 ko+모델+후처리로 대응
+  "glossary_id": null,                   // 저장된 도메인 사전 참조
+  "vad": true,                           // 무음 제거
+  "post_correction": {                   // 단계 제어
+    "mode": "rules",                     // "none"|"glossary"|"rules"|"llm"
+    "backend": "local",                  // llm 모드 시: "local"|"openai"|"external"
+    "corrector_model": null              // 백엔드별 모델/엔드포인트
+  }
+}
+```
+
+### 2) Transcription Engine (전사 엔진)
+
+- **런타임:** **faster-whisper (CTranslate2)** — openai-whisper 대비 ~4배 빠르고 메모리 적음, GPU/CPU·fp16/int8 지원, Silero VAD 내장, 배치 추론 지원.
+- **모델 전략(확정): 하이브리드**
+  - **실시간 경로 = large-v3-turbo** (저지연; 디코더 4층 경량화).
+  - **배치 경로 = large-v3** (혼용어/다국어 정확도 우위).
+  - 두 모델 모두 설치, 경로별 기본값 적용. `model` 옵션/환경변수로 런타임 오버라이드.
+- **혼용어 대응(핵심):**
+  1. `hotwords`/`initial_prompt`에 도메인 용어 주입 → 기술용어 음차화 방지.
+  2. 저장 가능한 **Glossary**(도메인 사전) → `glossary_id`로 재사용.
+  3. (옵션) 후처리 LLM 보정으로 잔여 오류 교정.
+- **제외:** distil-whisper, NVIDIA Parakeet/Canary(영어 중심 → 한국어 혼용 부적합).
+
+> ✅ **결정 근거:** 다국어/혼용어 정확도는 **large-v3가 turbo보다 우위**(turbo는 일부 언어 정확도 하락). 따라서 정확도 중요한 배치는 large-v3, 저지연 중요한 실시간은 turbo로 분리(하이브리드 확정). 추후 도메인 샘플 WER 벤치로 실시간 경로의 v3 승격 여부 재평가 가능.
+
+### 3) Realtime Pipeline (실시간)
+
+- WebSocket 오디오 프레임 → 링버퍼 → **Silero VAD**로 발화 구간 검출 → 청크 구성 → 전사 → 부분/최종 방출.
+- **안정화 정책:** LocalAgreement(연속 가설 일치분 확정) 또는 AlignAtt(2025 SOTA). 지연이 3~5초로 관대하므로 **큰 청크 + LocalAgreement-2**로 정확도 우선. 실시간 경로 기본 모델은 **turbo**.
+- **참고 구현:** [WhisperLive](https://github.com/collabora/WhisperLive)(faster-whisper 백엔드, WS, VAD), [WhisperLiveKit](https://github.com/QuentinFuxa/WhisperLiveKit)(AlignAtt), [whisper_streaming](https://github.com/ufal/whisper_streaming)(→ SimulStreaming 대체 추세). 정책 채택/재구현 모두 가능.
+
+### 4) Output / Results (출력·보관)
+
+- 텍스트 기본 + 요청 옵션별 타임스탬프/단어/화자/자막.
+- 포맷 변환: `json`(원천) → `txt`/`srt`/`vtt`/structured-`json`.
+- **보관(확정):** 결과/메타만 **7일** 보관(설정화·만료 자동삭제), **원본 오디오는 전사 직후 삭제**.
+- 저장소: 기본 로컬 파일/SQLite, 확장 시 S3/DB 가능.
+
+### 5) Job Queue / Concurrency (큐·동시성)
+
+- **Job 추상화:** 파일·실시간·영상 모두 Job으로 통일. 작업 중에도 신규 Job 계속 수신.
+- **우선순위 레인:** 실시간 세션=저지연 우선 / 배치=처리량 레인.
+- **워커풀:** 워커 수 = Device Manager 자동 산정. 각 워커가 디바이스 바인딩된 모델 인스턴스 보유.
+- **큐 백엔드(확정):** **Redis + RQ(no-fork)** 영속 큐를 처음부터 사용 → 재시작 내성·다중 워커 프로세스 지원. *Celery 기본 prefork는 CUDA와 충돌하므로 no-fork 워커 사용.* (Colab/개발용 in-process 폴백.)
+- **진행률:** 장시간 파일은 VAD 세그먼트 분할 → `progress = 완료 세그먼트 / 전체`(또는 처리 오디오초/전체초). `queue_position` = 큐 인덱스.
+- **백프레셔:** 최대 큐 길이 초과 시 `429`.
+
+### 6) Device Manager (능력 등급 자동판정 — 설계 중심축)
+
+**감지:** GPU 유무·device name·compute capability·**VRAM(총/여유)**, **시스템 RAM**, **디스크 여유**(모델 다운로드 공간).
+
+**능력 등급(Capability Tier) 자동판정:** 부팅 시 실측으로 *어떤 모델을 어디서 어떻게* 제공할지 결정 — **무음 모델 강등이 아니라, 등급이 "제공 가능 모델"을 정함.**
+
+| 등급 | 하드웨어가 감당하는 것 | 동작 |
+|---|---|---|
+| **T0 · CPU** | GPU로 turbo도 무리(또는 GPU 없음) | turbo를 **CPU**로 실행 |
+| **T1 · turbo-GPU** | turbo는 GPU OK, large-v3는 무리 | turbo만 GPU. **large-v3 미제공**(배치도 turbo) |
+| **T2 · 스왑** | large-v3는 되지만 turbo와 **동시 상주**는 무리 | 호출에 따라 모델 **로드/언로드**(한 번에 하나 상주, MRU 유지로 스왑 최소화) |
+| **T3 · 동시상주** | turbo + large-v3 **동시 적재** 가능 | 둘 다 상주 → 실시간(turbo)+배치(large-v3) **동시 처리** |
+| ~~T4 · 다중복제~~ | 모델 여러 벌 병렬 적재 | **제외**(복잡도 과다) |
+
+- **정밀도:** cc≥7.0 → fp16/int8_float16, Pascal(6.x, 예 1050) → int8, CPU → int8. (부팅 VRAM 실측으로 확정.)
+- **모델 적재 가능성은 부팅 실측으로 판정**(정적 상수 비의존): turbo 미적재→CPU(T0), large-v3 미적재→미제공(T1), 동시 미적재→스왑(T2), 동시 적재 가능→동시상주(T3).
+- **디스크 가드:** turbo ~1.6GB / large-v3 ~3GB 다운로드 전 여유 공간 점검, 부족 시 명확 오류.
+- **투명성:** `/v1/system`·`/v1/models`로 현재 **등급·제공 가능 모델**, 결과엔 **`model_used`/`compute_type_used`** 항상 표시 → 몰래 강등 없음.
+- **오버라이드:** `--device auto|cpu|cuda:N`, `--compute-type`, `--model`, `--workers`.
+- **CLI `detect`**로 등급·제공모델·권장설정 출력. **1050**: 보통 T1(turbo-int8)/T0. **T4~H100**: T3 하이브리드 풀가동.
+
+### 7) Post-processing (전사 오류 후처리)
+
+순차 파이프라인(요청별 `post_correction.mode`로 단계 제어):
+1. **Glossary/Hotwords (선택):** 반복되는 고정 도메인 용어를 1회 등록해 디코드 바이어스. *매 전사 예측이 아님* — 안 쓰면 ko 앵커+모델+규칙으로 대응.
+2. **Rule/Dictionary 정규화(deterministic):** 알려진 오인식 → 표준 용어 치환, 정규식, 약어 대소문자 보정.
+3. **LLM 보정(확정: 백엔드 설정화, 기본 off·confidence-gated):** 저신뢰/고WER 구간만 교정(Judge-Editor: 고신뢰 스팬 유지, 불확실 스팬만 재작성). **백엔드 플러그형** — `local`(소형 LLM, 오프라인·프라이버시) 또는 `openai`/`external`(OpenAI 호환 엔드포인트), `corrector_model` 설정 가능. 기본 비활성(약 HW 보호·과교정 방지).
+4. **Confidence 플래깅:** 세그먼트별 신뢰도 부여, 저신뢰 구간 표시 → 선택적 휴먼 리뷰.
+
+> ⚠️ **리서치 근거:** LLM 후처리는 **입력 WER이 높을 때(>10%)** WER을 크게 낮추지만, 이미 정확한 전사에는 **paraphrastic drift(과교정)** 위험 → 신뢰도 게이팅 필수. 도메인 고유명사/기술용어 손상이 핵심 위험(R-WER/EWER로 측정).
+> 🔒 **프라이버시:** `external`/`openai` 백엔드는 전사 텍스트를 외부로 전송하므로 내부 전용 정책과 상충 가능 → **기본은 `local`**, 외부 백엔드는 명시적 opt-in.
+
+### 8) Connectivity / Tunnel (Colab 자동 외부 노출)
+
+- **환경 자동 감지**(Colab/Kaggle/dev) → 옵션 시 터널 기동.
+- **기본: cloudflared Quick Tunnel** — `https://<random>.trycloudflare.com`, **계정/도메인 불필요**, 임시 URL, 제로 설정. (`--tunnel cloudflare`)
+- **대안: ngrok** — authtoken 필요, 무료는 재시작 시 URL 변경, 요청 인스펙션 제공. (`--tunnel ngrok --ngrok-token ...`)
+- **안정 도메인:** named Cloudflare Tunnel(CF 계정+도메인 필요).
+- **프로덕션/실IP:** `--tunnel none`, host IP 바인딩.
+- 기동 시 공개 URL + API Key 출력.
+
+### 배포 (Deployment) — 코드는 하나, 실행 프로파일 둘
+
+| | **Colab / 개발** | **프로덕션(내부)** |
+|---|---|---|
+| 실행 | **순수 Python / `run.sh`로 python 직접** (Docker ❌) | Docker + `docker-compose` |
+| 큐 | **in-proc(Redis 불필요)** | Redis + 별도 worker |
+| 저장 | 로컬 디스크 1개(공유 이슈 없음) | **공유 볼륨/오브젝트 스토어**(api↔worker) |
+| 가중치 | 받아서 캐시(예: Drive) | 받아서 캐시(미존재 시 다운로드) |
+| 외부노출 | **cloudflared 바이너리** 실행 | 실제 IP |
+
+- **CLI (`run.sh` + `cli.py`)**: `serve`(API+옵션 터널) / `transcribe <file>` / `bench`(모델·등급 벤치) / `detect`(등급·프로파일). **Colab은 Docker 불가 → 반드시 이 경로.**
+- **Docker(프로덕션)**: GPU 이미지(`nvidia/cuda`) + CPU(slim). compose: API + **Redis** + worker(+옵션 LLM). **입력 원본·파생 wav·결과는 공유 스토어**(컨테이너 경계에서 안 깨지게).
+- **모델 프로비저닝(공통):** 시작 시 존재 확인 → 없으면 다운로드 → 손상(로드 실패/체크섬) 시 재다운로드 → 준비 전엔 `status:"loading"`.
+- **설정:** env+`.env`/yaml — model(rt/batch), device, workers, api_keys+scopes, retention_days, tunnel, redis_url(프로덕션), corrector+allowlist.
+
+### 기술 스택 (제안)
+
+Python 3.11+, **FastAPI** + uvicorn, **faster-whisper(CTranslate2)** (turbo + large-v3), **ffmpeg**(영상→16kHz mono), **Silero VAD**(faster-whisper 내장), **Redis + RQ(no-fork)**(영속 큐), pydantic v2, **LLM 보정 백엔드**(local: llama.cpp/transformers · external: OpenAI 호환 client), (옵션) **pyannote.audio**(diarization, HF 토큰), **cloudflared**/pyngrok(터널), loguru/structlog(로깅), prometheus-client(메트릭, 옵션).
+
+---
+
+## Assumptions Exposed & Resolved (가정 노출·해소)
+
+| Assumption | Challenge | Resolution |
+|------------|-----------|------------|
+| "한 번에 한 작업이면 충분" | 동시/대기 다작업·중간 추가 입력은? | 큐를 1급 승격, Redis 영속, 우선순위 레인 + queue_position/progress |
+| "엔진은 정하면 끝" | 혼용어 vs 속도 충돌 | **하이브리드**(실시간 turbo / 배치 large-v3) + 핫워드/후처리 |
+| "동시성 수치를 정해야 함" | 1050~H100 폭이 너무 큼 | Device Manager가 VRAM/CC로 정밀도·워커수 자동 산정 |
+| "출력 형식을 시스템이 고정" | 호출마다 다를 수 있음 | 요청 `options`로 출력 옵션 전달(요청별) |
+| "실시간은 최저 지연 필수" | 3~5초도 허용 | 큰 청크 + 안정화 정책으로 정확도 우선 |
+| "Colab도 IP로 호출" | Colab은 공인 IP 없음 | cloudflared Quick Tunnel 자동 노출 |
+| "전사 결과만 보면 됨" | 오인식/오타 교정은? | 후처리 파이프라인(glossary→rules→LLM(opt, 백엔드 설정화)→flag) |
+| "후처리는 로컬만" | 외부 LLM 허용? | local/external 백엔드 설정화, external은 프라이버시 opt-in |
+
+---
+
+## Ontology (Key Entities)
+
+| Entity | Type | Fields | Relationships |
+|--------|------|--------|---------------|
+| Job | core | id, type(file/stream/video), status, queue_position, progress, options, created_at | has many Segment, produces TranscriptResult |
+| AudioInput | core | source(stream/file/video), codec, duration, size | belongs to Job |
+| Engine | core | runtime(faster-whisper), model(turbo/large-v3), compute_type | used by Worker |
+| Device | core | kind(gpu/cpu), name, vram_total/free, compute_capability | profiled by DeviceManager |
+| DeviceProfile | supporting | precision, max_workers, fallback | derived from Device |
+| Worker | core | id, device, model_instance, busy | consumes Queue(Redis), runs Engine |
+| Queue | core | lane(realtime/batch), depth, backend(redis) | holds Job |
+| Segment | supporting | index, start, end, text, words[], confidence | belongs to Job |
+| TranscriptResult | core | text, segments[], formats, language | belongs to Job |
+| RequestOptions | supporting | language, model, device, formats, hotwords, post_correction | configures Job |
+| Glossary | supporting | id, terms[] | applied to Engine/Post-processing |
+| PostProcessor | supporting | mode, backend(local/external), corrector_model, stages | transforms TranscriptResult |
+| Session(realtime) | core | ws_conn, buffer, options | produces partial/final Segment |
+| ApiKey | supporting | key, scopes, usage | authorizes Job |
+| RetentionPolicy | supporting | result_ttl=7d, delete_source=true | governs Output |
+| Tunnel | supporting | provider(cloudflare/ngrok/none), public_url | exposes API |
+
+## Ontology Convergence
+
+| Round | Entity Count | New | Changed | Stable | Stability Ratio |
+|-------|-------------|-----|---------|--------|----------------|
+| 1 | 10 | 10 | - | - | N/A |
+| 2 | 13 | 3 | 0 | 10 | ~77% |
+| 3 | 14 | 1 | 0 | 13 | ~92% |
+| 추가반영 | 16 | 2 (PostProcessor, Tunnel) | 0 | 14 | ~88% |
+| 결정확정 | 16 | 0 | 1 (Queue→Redis) | 16 | ~100% (수렴) |
+
+---
+
+## 확정된 결정 (Resolved Decisions)
+
+| # | 결정 | 확정 내용 |
+|---|------|-----------|
+| 1 | 모델 전략 | **하이브리드** — 실시간=turbo(저지연), 배치=large-v3(혼용어 정확도). 모델 설정 오버라이드 가능. |
+| 2 | 큐 영속성 | **Redis 영속 큐(RQ, no-fork)** 처음부터 — 재시작 내성·다중 워커. Colab/개발 in-process 폴백. |
+| 3 | LLM 후처리 | 포함, **백엔드 설정화**(local 소형 LLM / OpenAI 호환 external), 기본 off·confidence-gated. external은 프라이버시 opt-in. |
+| 4 | 결과 보관 | **7일**(설정화·자동삭제). 원본 오디오는 전사 직후 삭제. |
+| 5 | 파일 상한 | 모든 입력 **비동기 Job 기본**, 절대 상한 **4시간 / 2GB**(초과 `413`, 설정화). |
+| 6 | 화자 분리 | **옵션 포함**(pyannote, HF 토큰), 기본 off, 요청 시 `diarize=true`. |
+
+---
+
+## CCG 외부 리뷰 반영 (v2.2)
+
+외부 advisor(Codex/Gemini) 리뷰 + 사용자 확정으로 갱신:
+- **모델 프로비저닝:** 인터넷 다운로드 OK(에어갭 아님). 미존재→다운로드, 손상→재다운로드, 로딩 중 `status:"loading"`. (구 "오프라인 동작" 문구 폐기.)
+- **능력 등급 자동판정(§6):** GPU VRAM·RAM·디스크로 **T0(CPU)~T3(turbo+large-v3 동시상주)** 자동 결정, T4(다중복제) 제외. 결과에 **`model_used`** 항상 표시(무음 강등 없음).
+- **기본 언어 `ko`** + 요청별 override. **hotwords는 선택**(매 전사 예측 아님).
+- **WS 인증 = 첫 `init` 메시지**(api_key+오디오포맷+옵션).
+- **배포 2프로파일:** Colab/개발=순수 Python·in-proc·바이너리 cloudflared / 프로덕션=Docker+Redis+공유 스토어.
+- (미채택·추후 검토: webhook 콜백, Idempotency-Key, 목록 페이지네이션, 만료 결과 `410`.)
+
+---
+
+## 구현 로드맵 (제안 Phase)
+
+| Phase | 범위 | 산출 |
+|-------|------|------|
+| **P1 Core** | faster-whisper 통합(turbo+large-v3), Device Manager 자동감지, CLI `transcribe`/`detect`, 파일 동기 전사, ffmpeg 영상 추출 | 단발 전사 동작(1050/CPU 포함) |
+| **P2 API+Queue** | FastAPI, **Redis 영속 큐**·워커풀, 상태/진행률 API, API Key, 결과 보관(7일)·원본 삭제, Docker(GPU/CPU/Redis) | 비동기 배치 API |
+| **P3 Realtime** | WebSocket 스트리밍, VAD 청크, LocalAgreement 부분/최종(turbo) | 실시간 전사 |
+| **P4 Output+Post** | timestamps/word/SRT/VTT, glossary/rules 후처리, confidence flags | 풍부한 출력 + 1차 후처리 |
+| **P5 Advanced** | LLM 후처리 백엔드(local/external, 옵션), diarization(pyannote, 옵션), Colab cloudflared 자동, 메트릭/모니터링, `bench` | 운영·고급 기능 |
+
+## 리스크
+
+- **약 GPU(1050) 실시간 한계:** turbo도 Pascal/2GB에선 버거움 → int8/CPU 폴백, 실시간은 사실상 T4+ 권장(문서화).
+- **turbo 혼용어 정확도:** 핫워드/후처리로 보완하되 도메인 벤치로 검증, 필요 시 실시간도 v3 승격.
+- **LLM 후처리 과교정:** 신뢰도 게이팅 필수.
+- **외부 LLM 후처리 프라이버시:** `external`/`openai` 백엔드 사용 시 전사 텍스트 외부 전송 → 내부 전용 정책 검토 필요(기본 local).
+- **Redis 의존성:** 영속 큐가 Redis에 의존 → 단일 장애점. HA/단발용 in-process 폴백으로 완화.
+- **GPU 메모리 동시성:** 워커당 VRAM 추정 오차 → 보수적 산정 + OOM 재시도/강등.
+- **Quick Tunnel 임시 URL:** trycloudflare는 비영구 → 안정 필요 시 named tunnel/ngrok.
+
+---
+
+## Interview Transcript (요약)
+
+<details>
+<summary>Q&A (3 라운드 + 추가 아이디어 + 결정 확정)</summary>
+
+**Round 0 — Topology:** 4개 컴포넌트 + 큐/동시성·중간 추가입력·대기열 위치·진행률·대용량 처리 요구 → 큐 1급 승격.
+
+**Round 1 (100%→~40%)**
+- 엔진: 혼용어 대응 + 속도 중요, 후보 large-v3, 추천 요청.
+- 언어: 한국어+자동감지, 혼용어("vLLM 쓰면 성능 대박") 정확 전사 필수.
+- 실시간 전송: **WebSocket**.
+
+**Round 2 (~40%→~26%)**
+- 엔진: turbo 단일(속도) → 이후 하이브리드로 확정.
+- 하드웨어: GTX 1050·Colab·T4/L4/A100/H100 → **자동 감지·자동 용량 산정**.
+- 동시성: **하드웨어 기반 자동 조절**.
+- 출력: **요청별 옵션**.
+
+**Round 3 (~26%→~14%, PASSED)**
+- 배포: **CLI + Docker/FastAPI**. 인증: **API Key**. 보관: **결과만·원본 삭제**. 실시간 지연: **3~5초 관대**.
+
+**추가 아이디어 (반영)**
+- 전사 오류 **후처리** → Post-processing(glossary→rules→LLM(opt)→flag).
+- Colab 등 공인 IP 부재 환경 **자동 외부 노출** → cloudflared Quick Tunnel 기본.
+
+**열린 결정 확정 (~14%→~10%)**
+- 모델=하이브리드, 큐=Redis 영속, LLM 후처리=백엔드 설정화(local/external), 보관 7일, 상한 4h/2GB, 화자분리 옵션(off).
+
+</details>
+
+---
+*Generated by deep-interview · threshold 20% (default) · ambiguity ~10% (열린 결정 6건 확정) · PASSED*