luke_scribe/.omc/artifacts/ask/codex-20260603-095739.md

Reading prompt from stdin...
OpenAI Codex v0.136.0
--------
workdir: /root/luke_scribe
model: gpt-5.4
provider: openai
approval: never
sandbox: danger-full-access
reasoning effort: xhigh
reasoning summaries: detailed
session id: 019e8afc-736d-73e1-b4c5-d655d8391b87
--------
user
You are a senior backend + ML-systems architect performing an EXTERNAL, read-only review. Do NOT edit any files. Read these two design documents in the current repo:

1. .omc/specs/deep-interview-luke-scribe-stt-api.md  (requirements spec)
2. .omc/plans/consensus-luke-scribe-stt-api.md  (implementation plan v2.1; already passed an internal Architect+Critic consensus)

System under design: a greenfield INTERNAL (private) speech-to-text transcription API.
- Python 3.11 + FastAPI + faster-whisper (CTranslate2).
- Hybrid models: realtime = large-v3-turbo, batch = large-v3 (gated by a P1 WER bench).
- Device Manager auto-detects GPU/CPU (GTX 1050 -> H100), boot-time VRAM probe, precision auto (fp16/int8/int8_float16/CPU), worker-count formula.
- Redis (RQ SimpleWorker, no-fork, model-loaded-once) durable queue for batch; a separate long-lived in-process WebSocket handler for realtime with run_in_executor offload + single GPU lock.
- Cooperative cancellation at segment boundaries; temp-file finally cleanup; retention sweeper on terminal jobs only; results kept 7d, original audio deleted immediately.
- API-key auth (+scopes); optional post-correction (glossary -> rules -> LLM local/external with egress allowlist).
- CLI (dev/Colab) + Docker (prod) + Colab cloudflared auto-tunnel.

Two internal reviewers already raised and the plan ALREADY addresses: RQ fork/CUDA incompatibility, wrong VRAM constant (now boot-probed; large-v3 fp16 ~10GB), in-flight cancel + temp lifecycle + retention race, AC-6 progress via throttled job.save_meta(), hybrid-before-bench (bench moved to P1), event-loop blocking of the realtime handler (run_in_executor), realtime VRAM accounting in `reserve`, RQ job_timeout >= 4h, effective-compute_type logging, realtime session cap. Do NOT just restate these as resolved.

Your job: a FRESH external pass. Output a PRIORITIZED review of issues that are STILL present or NEWLY spotted. Specifically assess:
1. Architectural correctness bugs or unsound decisions remaining.
2. faster-whisper / CTranslate2 / GPU / VRAM / concurrency risks still under-handled (CT2 SimpleWorker semantics, multi-model residency, batched inference vs sequential, VAD, word-timestamp alignment cost, Pascal/compute-cap support, model download/caching/cold-start, CUDA/cuDNN pinning).
3. Security gaps (API key storage/rotation, external egress/SSRF, audio PII, Colab tunnel exposing an internal API publicly).
4. Test-plan / acceptance-criteria adequacy gaps.
5. Anything in the API/queue/realtime design that will bite during implementation.

End with: the TOP 3 highest-priority changes you would require before writing code, and an overall verdict (SOUND-TO-BUILD or NEEDS-CHANGES). Cite section numbers (e.g., section 3.5, AC-7). Be concise (~500-700 words). Read-only; produce only the review text.
2026-06-03T00:57:48.437823Z ERROR codex_core_skills::loader: failed to stat skills path /root/.codex/.tmp/plugins/plugins/figma/skills/figma-generate-design/agents: No such file or directory (os error 2)
2026-06-03T00:57:48.438347Z ERROR codex_core_skills::loader: failed to stat skills path /root/.codex/.tmp/plugins/plugins/figma/skills/figma-generate-design/LICENSE.txt: No such file or directory (os error 2)
2026-06-03T00:57:48.442618Z ERROR codex_core_skills::loader: failed to read skills dir /root/.codex/.tmp/plugins/plugins/figma/skills/figma-create-design-system-rules: No such file or directory (os error 2)
2026-06-03T00:57:48.447216Z ERROR codex_core_skills::loader: failed to read skills dir /root/.codex/.tmp/plugins/plugins/figma/skills/figma-code-connect-components/agents: No such file or directory (os error 2)
2026-06-03T00:57:48.447570Z ERROR codex_core_skills::loader: failed to read skills dir /root/.codex/.tmp/plugins/plugins/figma/skills/figma-code-connect-components/scripts: No such file or directory (os error 2)
2026-06-03T00:57:48.447853Z ERROR codex_core_skills::loader: failed to read skills dir /root/.codex/.tmp/plugins/plugins/figma/skills/figma-code-connect-components/references: No such file or directory (os error 2)
2026-06-03T00:57:48.448211Z ERROR codex_core_skills::loader: failed to read skills dir /root/.codex/.tmp/plugins/plugins/figma/skills/figma-generate-library/agents: No such file or directory (os error 2)
2026-06-03T00:57:48.460245Z ERROR codex_core_skills::loader: failed to read skills dir /root/.codex/.tmp/plugins/plugins/figma/skills/figma-implement-design/agents: No such file or directory (os error 2)
2026-06-03T00:57:48.460578Z ERROR codex_core_skills::loader: failed to read skills dir /root/.codex/.tmp/plugins/plugins/figma/skills/figma-implement-design/scripts: No such file or directory (os error 2)
2026-06-03T00:57:48.460789Z ERROR codex_core_skills::loader: failed to read skills dir /root/.codex/.tmp/plugins/plugins/figma/skills/figma-implement-design/references: No such file or directory (os error 2)
2026-06-03T00:57:48.461138Z ERROR codex_core_skills::loader: failed to read skills dir /root/.codex/.tmp/plugins/plugins/figma/skills/figma-use/agents: No such file or directory (os error 2)
**Figuring out design document reviews**

I need to perform a read-only review of two design documents and send commentary before doing substantial work. The user has asked for an external read-only review, which makes me think of the "review" skill. However, since this is a design doc review, I wonder if I should use the "plan-eng-review" skill instead. The instructions say to use the skill when it matches the task description, so I’ll go ahead and open the skill file to follow the proper workflow.
**Planning document review process**

I need to announce using the skill and then read the documents. Since it's read-only, I'm considering using tools like `rg`, `sed`, or `cat`. The developer suggests using `multi_tool_use.parallel` for reading files, rather than chaining commands. I think I’ll start by checking the skill, and I might use `exec_command cat` for reading. It might be helpful to update the plan but isn't necessary. I’ll inspect the relevant files and the skill file, using parallel commands for efficiency, while ensuring that I can reference necessary line and section numbers.
**Analyzing and updating the plan**

I really need to analyze the plan thoroughly before moving forward. It's important to ensure everything is well-understood and any potential issues are identified. After that, I will send a commentary update. This will help keep everyone in the loop and ensure that my insights on the plan are communicated effectively. I want to make sure I’m being thorough so that nothing important gets overlooked.
codex
Reviewing the requirements spec and implementation plan as a read-only architecture pass. I’m loading the plan-review skill guidance first, then I’ll read both documents with line numbers and extract only remaining or new issues.
exec
/bin/bash -lc "sed -n '1,220p' /root/.claude/skills/gstack/.agents/skills/gstack-plan-eng-review/SKILL.md" in /root/luke_scribe
 succeeded in 0ms:
---
name: plan-eng-review
description: |
  Eng manager-mode plan review. Lock in the execution plan — architecture,
  data flow, diagrams, edge cases, test coverage, performance. Walks through
  issues interactively with opinionated recommendations. Use when asked to
  "review the architecture", "engineering review", or "lock in the plan".
  Proactively suggest when the user has a plan or design doc and is about to
  start coding — to catch architecture issues before implementation. (gstack)
  Voice triggers (speech-to-text aliases): "tech review", "technical review", "plan engineering review".
---
<!-- AUTO-GENERATED from SKILL.md.tmpl — do not edit directly -->
<!-- Regenerate: bun run gen:skill-docs -->

## Preamble (run first)

```bash
_ROOT=$(git rev-parse --show-toplevel 2>/dev/null)
GSTACK_ROOT="$HOME/.codex/skills/gstack"
[ -n "$_ROOT" ] && [ -d "$_ROOT/.agents/skills/gstack" ] && GSTACK_ROOT="$_ROOT/.agents/skills/gstack"
GSTACK_BIN="$GSTACK_ROOT/bin"
GSTACK_BROWSE="$GSTACK_ROOT/browse/dist"
GSTACK_DESIGN="$GSTACK_ROOT/design/dist"
_UPD=$($GSTACK_BIN/gstack-update-check 2>/dev/null || .agents/skills/gstack/bin/gstack-update-check 2>/dev/null || true)
[ -n "$_UPD" ] && echo "$_UPD" || true
mkdir -p ~/.gstack/sessions
touch ~/.gstack/sessions/"$PPID"
_SESSIONS=$(find ~/.gstack/sessions -mmin -120 -type f 2>/dev/null | wc -l | tr -d ' ')
find ~/.gstack/sessions -mmin +120 -type f -exec rm {} + 2>/dev/null || true
_PROACTIVE=$($GSTACK_BIN/gstack-config get proactive 2>/dev/null || echo "true")
_PROACTIVE_PROMPTED=$([ -f ~/.gstack/.proactive-prompted ] && echo "yes" || echo "no")
_BRANCH=$(git branch --show-current 2>/dev/null || echo "unknown")
echo "BRANCH: $_BRANCH"
_SKILL_PREFIX=$($GSTACK_BIN/gstack-config get skill_prefix 2>/dev/null || echo "false")
echo "PROACTIVE: $_PROACTIVE"
echo "PROACTIVE_PROMPTED: $_PROACTIVE_PROMPTED"
echo "SKILL_PREFIX: $_SKILL_PREFIX"
source <($GSTACK_BIN/gstack-repo-mode 2>/dev/null) || true
REPO_MODE=${REPO_MODE:-unknown}
echo "REPO_MODE: $REPO_MODE"
_LAKE_SEEN=$([ -f ~/.gstack/.completeness-intro-seen ] && echo "yes" || echo "no")
echo "LAKE_INTRO: $_LAKE_SEEN"
_TEL=$($GSTACK_BIN/gstack-config get telemetry 2>/dev/null || true)
_TEL_PROMPTED=$([ -f ~/.gstack/.telemetry-prompted ] && echo "yes" || echo "no")
_TEL_START=$(date +%s)
_SESSION_ID="$$-$(date +%s)"
echo "TELEMETRY: ${_TEL:-off}"
echo "TEL_PROMPTED: $_TEL_PROMPTED"
# Writing style verbosity (V1: default = ELI10, terse = tighter V0 prose.
# Read on every skill run so terse mode takes effect without a restart.)
_EXPLAIN_LEVEL=$($GSTACK_BIN/gstack-config get explain_level 2>/dev/null || echo "default")
if [ "$_EXPLAIN_LEVEL" != "default" ] && [ "$_EXPLAIN_LEVEL" != "terse" ]; then _EXPLAIN_LEVEL="default"; fi
echo "EXPLAIN_LEVEL: $_EXPLAIN_LEVEL"
# Question tuning (see /plan-tune). Observational only in V1.
_QUESTION_TUNING=$($GSTACK_BIN/gstack-config get question_tuning 2>/dev/null || echo "false")
echo "QUESTION_TUNING: $_QUESTION_TUNING"
mkdir -p ~/.gstack/analytics
if [ "$_TEL" != "off" ]; then
echo '{"skill":"plan-eng-review","ts":"'$(date -u +%Y-%m-%dT%H:%M:%SZ)'","repo":"'$(basename "$(git rev-parse --show-toplevel 2>/dev/null)" 2>/dev/null || echo "unknown")'"}'  >> ~/.gstack/analytics/skill-usage.jsonl 2>/dev/null || true
fi
# zsh-compatible: use find instead of glob to avoid NOMATCH error
for _PF in $(find ~/.gstack/analytics -maxdepth 1 -name '.pending-*' 2>/dev/null); do
  if [ -f "$_PF" ]; then
    if [ "$_TEL" != "off" ] && [ -x "$GSTACK_BIN/gstack-telemetry-log" ]; then
      $GSTACK_BIN/gstack-telemetry-log --event-type skill_run --skill _pending_finalize --outcome unknown --session-id "$_SESSION_ID" 2>/dev/null || true
    fi
    rm -f "$_PF" 2>/dev/null || true
  fi
  break
done
# Learnings count
eval "$($GSTACK_BIN/gstack-slug 2>/dev/null)" 2>/dev/null || true
_LEARN_FILE="${GSTACK_HOME:-$HOME/.gstack}/projects/${SLUG:-unknown}/learnings.jsonl"
if [ -f "$_LEARN_FILE" ]; then
  _LEARN_COUNT=$(wc -l < "$_LEARN_FILE" 2>/dev/null | tr -d ' ')
  echo "LEARNINGS: $_LEARN_COUNT entries loaded"
  if [ "$_LEARN_COUNT" -gt 5 ] 2>/dev/null; then
    $GSTACK_BIN/gstack-learnings-search --limit 3 2>/dev/null || true
  fi
else
  echo "LEARNINGS: 0"
fi
# Session timeline: record skill start (local-only, never sent anywhere)
$GSTACK_BIN/gstack-timeline-log '{"skill":"plan-eng-review","event":"started","branch":"'"$_BRANCH"'","session":"'"$_SESSION_ID"'"}' 2>/dev/null &
# Check if CLAUDE.md has routing rules
_HAS_ROUTING="no"
if [ -f CLAUDE.md ] && grep -q "## Skill routing" CLAUDE.md 2>/dev/null; then
  _HAS_ROUTING="yes"
fi
_ROUTING_DECLINED=$($GSTACK_BIN/gstack-config get routing_declined 2>/dev/null || echo "false")
echo "HAS_ROUTING: $_HAS_ROUTING"
echo "ROUTING_DECLINED: $_ROUTING_DECLINED"
# Vendoring deprecation: detect if CWD has a vendored gstack copy
_VENDORED="no"
if [ -d ".agents/skills/gstack" ] && [ ! -L ".agents/skills/gstack" ]; then
  if [ -f ".agents/skills/gstack/VERSION" ] || [ -d ".agents/skills/gstack/.git" ]; then
    _VENDORED="yes"
  fi
fi
echo "VENDORED_GSTACK: $_VENDORED"
echo "MODEL_OVERLAY: claude"
# Checkpoint mode (explicit = no auto-commit, continuous = WIP commits as you go)
_CHECKPOINT_MODE=$($GSTACK_BIN/gstack-config get checkpoint_mode 2>/dev/null || echo "explicit")
_CHECKPOINT_PUSH=$($GSTACK_BIN/gstack-config get checkpoint_push 2>/dev/null || echo "false")
echo "CHECKPOINT_MODE: $_CHECKPOINT_MODE"
echo "CHECKPOINT_PUSH: $_CHECKPOINT_PUSH"
# Detect spawned session (OpenClaw or other orchestrator)
[ -n "$OPENCLAW_SESSION" ] && echo "SPAWNED_SESSION: true" || true
```

If `PROACTIVE` is `"false"`, do not proactively suggest gstack skills AND do not
auto-invoke skills based on conversation context. Only run skills the user explicitly
types (e.g., /qa, /ship). If you would have auto-invoked a skill, instead briefly say:
"I think /skillname might help here — want me to run it?" and wait for confirmation.
The user opted out of proactive behavior.

If `SKILL_PREFIX` is `"true"`, the user has namespaced skill names. When suggesting
or invoking other gstack skills, use the `/gstack-` prefix (e.g., `/gstack-qa` instead
of `/qa`, `/gstack-ship` instead of `/ship`). Disk paths are unaffected — always use
`$GSTACK_ROOT/[skill-name]/SKILL.md` for reading skill files.

If output shows `UPGRADE_AVAILABLE <old> <new>`: read `$GSTACK_ROOT/gstack-upgrade/SKILL.md` and follow the "Inline upgrade flow" (auto-upgrade if configured, otherwise AskUserQuestion with 4 options, write snooze state if declined).

If output shows `JUST_UPGRADED <from> <to>` AND `SPAWNED_SESSION` is NOT set: tell
the user "Running gstack v{to} (just updated!)" and then check for new features to
surface. For each per-feature marker below, if the marker file is missing AND the
feature is plausibly useful for this user, use AskUserQuestion to let them try it.
Fire once per feature per user, NOT once per upgrade.

**In spawned sessions (`SPAWNED_SESSION` = "true"): SKIP feature discovery entirely.**
Just print "Running gstack v{to}" and continue. Orchestrators do not want interactive
prompts from sub-sessions.

**Feature discovery markers and prompts** (one at a time, max one per session):

1. `$GSTACK_ROOT/.feature-prompted-continuous-checkpoint` →
   Prompt: "Continuous checkpoint auto-commits your work as you go with `WIP:` prefix
   so you never lose progress to a crash. Local-only by default — doesn't push
   anywhere unless you turn that on. Want to try it?"
   Options: A) Enable continuous mode, B) Show me first (print the section from
   the preamble Continuous Checkpoint Mode), C) Skip.
   If A: run `$GSTACK_BIN/gstack-config set checkpoint_mode continuous`.
   Always: `touch $GSTACK_ROOT/.feature-prompted-continuous-checkpoint`

2. `$GSTACK_ROOT/.feature-prompted-model-overlay` →
   Inform only (no prompt): "Model overlays are active. `MODEL_OVERLAY: {model}`
   shown in the preamble output tells you which behavioral patch is applied.
   Override with `--model` when regenerating skills (e.g., `bun run gen:skill-docs
   --model gpt-5.4`). Default is claude."
   Always: `touch $GSTACK_ROOT/.feature-prompted-model-overlay`

After handling JUST_UPGRADED (prompts done or skipped), continue with the skill
workflow.

If `WRITING_STYLE_PENDING` is `yes`: You're on the first skill run after upgrading
to gstack v1. Ask the user once about the new default writing style. Use AskUserQuestion:

> v1 prompts = simpler. Technical terms get a one-sentence gloss on first use,
> questions are framed in outcome terms, sentences are shorter.
>
> Keep the new default, or prefer the older tighter prose?

Options:
- A) Keep the new default (recommended — good writing helps everyone)
- B) Restore V0 prose — set `explain_level: terse`

If A: leave `explain_level` unset (defaults to `default`).
If B: run `$GSTACK_BIN/gstack-config set explain_level terse`.

Always run (regardless of choice):
```bash
rm -f ~/.gstack/.writing-style-prompt-pending
touch ~/.gstack/.writing-style-prompted
```

This only happens once. If `WRITING_STYLE_PENDING` is `no`, skip this entirely.

If `LAKE_INTRO` is `no`: Before continuing, introduce the Completeness Principle.
Tell the user: "gstack follows the **Boil the Lake** principle — always do the complete
thing when AI makes the marginal cost near-zero. Read more: https://garryslist.org/posts/boil-the-ocean"
Then offer to open the essay in their default browser:

```bash
open https://garryslist.org/posts/boil-the-ocean
touch ~/.gstack/.completeness-intro-seen
```

Only run `open` if the user says yes. Always run `touch` to mark as seen. This only happens once.

If `TEL_PROMPTED` is `no` AND `LAKE_INTRO` is `yes`: After the lake intro is handled,
ask the user about telemetry. Use AskUserQuestion:

> Help gstack get better! Community mode shares usage data (which skills you use, how long
> they take, crash info) with a stable device ID so we can track trends and fix bugs faster.
> No code, file paths, or repo names are ever sent.
> Change anytime with `gstack-config set telemetry off`.

Options:
- A) Help gstack get better! (recommended)
- B) No thanks

If A: run `$GSTACK_BIN/gstack-config set telemetry community`

If B: ask a follow-up AskUserQuestion:

> How about anonymous mode? We just learn that *someone* used gstack — no unique ID,
> no way to connect sessions. Just a counter that helps us know if anyone's out there.

Options:
- A) Sure, anonymous is fine
- B) No thanks, fully off

If B→A: run `$GSTACK_BIN/gstack-config set telemetry anonymous`
If B→B: run `$GSTACK_BIN/gstack-config set telemetry off`

Always run:
```bash
touch ~/.gstack/.telemetry-prompted
```


exec
/bin/bash -lc "nl -ba .omc/specs/deep-interview-luke-scribe-stt-api.md | sed -n '1,260p'" in /root/luke_scribe
exec
/bin/bash -lc "nl -ba .omc/plans/consensus-luke-scribe-stt-api.md | sed -n '1,320p'" in /root/luke_scribe
 succeeded in 0ms:
     1	# Deep Interview Spec: luke_scribe — 로컬 STT 전사 API 시스템
     2
     3	> 내부용(비공개) 음성/영상 → 텍스트 전사 API. 로컬 모델 실행, GPU/CPU 자동·수동 선택,
     4	> 실시간(WebSocket) + 배치(파일/영상), 작업 큐·진행률, 혼용어 대응, 후처리, Colab 자동 노출.
     5
     6	## Metadata
     7	- Interview ID: `di-luke-scribe-stt-20260602`
     8	- Rounds: 3 (스코어링) + 추가 아이디어 1 + 열린 결정 확정 1
     9	- Final Ambiguity Score: **~10%** (threshold 20%; 열린 결정 6건 확정 후)
    10	- Type: **greenfield** (빈 저장소 `luke_scribe`)
    11	- Generated: 2026-06-02
    12	- Threshold: 0.2 / Threshold Source: `default`
    13	- Initial Context Summarized: no
    14	- Status: **PASSED · 결정 확정 완료**
    15
    16	## Clarity Breakdown
    17	| Dimension | Score | Weight | Weighted |
    18	|-----------|-------|--------|----------|
    19	| Goal Clarity | 0.94 | 0.40 | 0.376 |
    20	| Constraint Clarity | 0.90 | 0.30 | 0.270 |
    21	| Success Criteria | 0.86 | 0.30 | 0.258 |
    22	| **Total Clarity** | | | **0.904** |
    23	| **Ambiguity** | | | **0.096 (~10%)** |
    24
    25	---
    26
    27	## Topology (확정 컴포넌트)
    28
    29	| # | Component | Status | 설명 | 커버리지 |
    30	|---|-----------|--------|------|----------|
    31	| 1 | **Ingestion API** | active | 실시간 스트림(WebSocket) + 파일/영상 업로드 수집 | AC-1, AC-7, AC-9 |
    32	| 2 | **Transcription Engine** | active | 로컬 STT(faster-whisper), **하이브리드: 실시간=turbo / 배치=large-v3** | AC-4 |
    33	| 3 | **Realtime Pipeline** | active | VAD·청크·부분/최종 결과 스트리밍 | AC-8 |
    34	| 4 | **Output / Results** | active | 요청별 출력옵션(txt/ts/word/diarize/SRT/VTT), 결과 보관(7일) | AC-9, AC-11 |
    35	| 5 | **Job Queue / Concurrency** (1급) | active | Job 추상화, **Redis 영속 큐**, 워커풀, 우선순위 레인, queue_position·진행률 | AC-5, AC-6 |
    36	| 6 | **Device Manager** (횡단) | active | GPU/CPU 자동감지 → 정밀도·워커수·동시성 자동 산정, 강제 플래그 | AC-2, AC-3 |
    37	| 7 | **Post-processing** | active | glossary/rules + (옵션)LLM 보정(백엔드 설정화) + confidence 플래그 | AC-10 |
    38	| 8 | **Connectivity / Tunnel** | active | Colab 등 공인 IP 없는 환경 자동 외부 노출(cloudflared 등) | AC-13 |
    39
    40	---
    41
    42	## Goal
    43
    44	**내부 서비스가 호출하는 비공개 API로, 실시간 음성·녹음 파일·mp3·mp4(및 기타 영상)를 입력받아 로컬에서 실행되는 STT 모델로 텍스트로 전사한다.** 실시간 입력은 준실시간(3~5초 내 부분 결과)으로 전사한다. 모델은 감지된 하드웨어(GPU/CPU)에 맞춰 정밀도·동시성을 자동 결정하되 `auto | cpu | cuda` 강제 선택도 가능하다. 다수 작업을 동시/대기열로 처리하고, 호출자는 대기열 위치와 진행률을 조회할 수 있다. 한국어 중심이되 한·영 혼용 기술용어(예: "API", "vLLM")를 음차로 망가뜨리지 않고 정확히 전사한다. **정확도가 중요한 배치는 large-v3, 저지연이 중요한 실시간은 turbo**로 분리한다(하이브리드).
    45
    46	---
    47
    48	## Constraints (제약)
    49
    50	- **로컬 실행 전용(STT).** 외부 STT API 의존 금지. 모델 가중치는 로컬/오프라인 동작.
    51	- **하드웨어 폭이 매우 넓음:** 개발=GTX 1050(Pascal, 2~4GB), 테스트=Colab/T4/L4/A100/H100. → 고정 수치 설정 불가, **자동 산정 필수**.
    52	- **정밀도 자동 선택:** compute capability ≥ 7.0 → fp16, Pascal(6.x) → int8, VRAM 부족 → CPU 폴백, CPU → int8.
    53	- **동시성/워커 수는 감지된 VRAM·코어로 자동 산정**(오버라이드만 허용).
    54	- **모델 하이브리드:** 실시간=turbo, 배치=large-v3 (둘 다 설치, `model` 오버라이드 가능).
    55	- **언어:** 한국어 우선 + 자동 감지, 한·영 혼용(code-switching) 정확도가 하드 요구.
    56	- **실시간 전송:** WebSocket. 목표 지연 3~5초(관대) → 정확도 우선 청킹 가능.
    57	- **인증:** API Key 헤더(내부용).
    58	- **큐:** **Redis 영속 큐**(RQ/Celery) — 재시작 내성·다중 워커. 개발/단발용 in-process 폴백 제공.
    59	- **보관:** 결과/메타만 **7일** 보관(설정화·자동삭제), **업로드 원본 오디오는 처리 후 즉시 삭제**.
    60	- **파일 상한:** 모든 입력 **비동기 Job 기본**, 절대 상한 **4시간 / 2GB**(초과 `413`, 설정화).
    61	- **배포 이원화:** CLI(셸 스크립트)=개발·테스트·Colab / Docker(FastAPI/Python)=프로덕션(내부).
    62	- **CPU 폴백은 항상 지원.**
    63
    64	## Non-Goals (명시적 비범위)
    65
    66	- 외부 공개(public) API·과금·멀티테넌시 SaaS 기능.
    67	- 자체 STT 모델 학습/파인튜닝(기성 Whisper 계열 사용).
    68	- 번역(translation) — 1차 범위 외.
    69	- 프런트엔드 UI(API/CLI만 제공).
    70	- 영구 원본 오디오 아카이빙(원본은 삭제가 기본).
    71
    72	---
    73
    74	## Acceptance Criteria (검증 가능 기준)
    75
    76	- [ ] **AC-1** 동일 시스템으로 파일(오디오/영상)·실시간(WebSocket) 입력을 모두 전사한다.
    77	- [ ] **AC-2** `device=auto`가 GTX 1050에서 int8/CPU로, T4/L4/A100/H100에서 fp16로 자동 동작하고, `cpu`/`cuda[:n]` 강제 플래그가 동작한다.
    78	- [ ] **AC-3** 정밀도·워커 수가 감지된 VRAM/compute capability로 자동 산정되며 `--workers`/`--compute-type` 오버라이드가 가능하다.
    79	- [ ] **AC-4** 혼용어 검증: *"그 API 서빙할 때 vLLM 쓰면 성능 대박이야"* 입력 시 "API", "vLLM"이 영문 그대로(핫워드 적용 시) 전사된다. 배치 경로(large-v3)에서 정확도가 더 높음을 확인한다.
    80	- [ ] **AC-5** 동시 다중 작업을 받아 Redis 큐에 적재/동시 처리하며, 작업 중에도 신규 입력을 계속 수신한다.
    81	- [ ] **AC-6** 호출자가 `queue_position`(앞 N건)과 `progress`(처리된 길이/전체, %)를 조회할 수 있다.
    82	- [ ] **AC-7** 장시간/대용량 파일이 VAD 세그먼트로 분할되어 진행률을 제공하고 메모리 사용이 일정하다. 4h/2GB 초과는 `413`.
    83	- [ ] **AC-8** 실시간 부분 결과가 3~5초 내 스트리밍되고 최종 결과로 안정화된다(turbo 경로).
    84	- [ ] **AC-9** 영상 파일이 ffmpeg로 오디오 추출 후 전사되고, 출력 옵션(timestamps/word/diarize/formats)이 요청별로 동작한다.
    85	- [ ] **AC-10** 후처리: glossary/rules가 동작하고, LLM 보정(백엔드 `local`/`external` 설정화, 기본 off·신뢰도 게이팅)과 저신뢰 구간 플래그가 동작한다.
    86	- [ ] **AC-11** API Key 인증이 적용되고, 전사 완료 후 원본 오디오가 삭제되며 결과만 7일 보관된다.
    87	- [ ] **AC-12** CLI(`serve`/`transcribe`/`bench`/`detect`)와 Docker(GPU/CPU 이미지 + Redis)로 각각 실행된다.
    88	- [ ] **AC-13** Colab에서 `--tunnel cloudflare`로 공개 URL이 자동 발급되어 외부에서 호출된다.
    89
    90	---
    91
    92	## Architecture (상세 설계)
    93
    94	### 시스템 개요도
    95
    96	```
    97	                         ┌───────────────────────────────────────────────┐
    98	  내부 호출자             │                luke_scribe API                 │
    99	  (서비스/CLI)            │                                                │
   100	        │                │  ┌──────────────┐     ┌────────────────────┐  │
   101	  REST  ├──── 파일/영상 ─▶│  │ Ingestion API│────▶│  Job Queue (Redis)  │ │
   102	  (HTTP)│                │  │ (FastAPI)    │     │  - priority lanes   │  │
   103	        │                │  │  - upload    │     │   (realtime/batch)  │  │
   104	  WS    ├── 실시간 오디오▶│  │  - WS stream │     │  - queue_position   │  │
   105	        │                │  │  - auth(API  │     │  - progress         │  │
   106	        │                │  │    Key)      │     │  - durable/재시작내성 │ │
   107	        │                │  └──────┬───────┘     └─────────┬──────────┘  │
   108	        │                │         │ ffmpeg(영상→오디오)    │ dispatch     │
   109	        │                │         ▼                        ▼             │
   110	        │                │  ┌──────────────┐     ┌────────────────────┐  │
   111	        │                │  │ Realtime     │     │  Worker Pool        │  │
   112	        │                │  │ Pipeline     │     │  (N = 자동산정)      │  │
   113	        │                │  │ VAD→chunk→   │◀───▶│  ┌───────────────┐  │  │
   114	        │                │  │ partial/final│     │  │ Engine        │  │  │
   115	        │                │  │ (turbo)      │     │  │ faster-whisper│  │  │
   116	        │                │  └──────┬───────┘     │  │ rt=turbo      │  │  │
   117	        │                │         │             │  │ batch=large-v3│  │  │
   118	        │                │         ▼             │  └──────┬────────┘  │  │
   119	        │                │  ┌──────────────┐     │         │           │  │
   120	        │                │  │Post-processing│◀───┤         │ uses       │  │
   121	        │                │  │glossary/rules │    └─────────┼───────────┘  │
   122	        │                │  │+LLM(opt,plug) │              │              │
   123	        │                │  │+conf flag     │    ┌─────────▼──────────┐  │
   124	        │                │  └──────┬───────┘     │  Device Manager     │  │
   125	        │                │         ▼             │  GPU/CPU 감지 →      │  │
   126	        │◀── 결과/진행률 ─┤  ┌──────────────┐    │  fp16·int8·CPU /     │  │
   127	        │   (txt/srt/    │  │Output/Results│    │  worker수·동시성     │  │
   128	        │    vtt/json)   │  │store(7일,결과)│    └────────────────────┘  │
   129	        │                │  └──────────────┘                              │
   130	        │                │  Connectivity/Tunnel (Colab→cloudflared 자동)  │
   131	        │                └───────────────────────────────────────────────┘
   132	```
   133
   134	### 1) Ingestion API (입력/수집)
   135
   136	**REST (배치/파일):**
   137	| Method | Path | 설명 |
   138	|--------|------|------|
   139	| `POST` | `/v1/jobs` | multipart: `file`(오디오/영상) + `options`(JSON). → `{job_id, status:"queued", queue_position}` |
   140	| `GET` | `/v1/jobs/{id}` | 상태 조회: `queued`(queue_position, jobs_ahead) / `processing`(progress %, processed_sec/total_sec, eta) / `completed` / `failed`(error) |
   141	| `GET` | `/v1/jobs/{id}/result?format=txt\|srt\|vtt\|json` | 결과 조회(포맷 변환) |
   142	| `DELETE` | `/v1/jobs/{id}` | 작업 취소 |
   143	| `GET` | `/v1/jobs` | 작업 목록/필터 |
   144
   145	**WebSocket (실시간):** `WS /v1/stream`
   146	- 1) 핸드셰이크 메시지(JSON): `{options:{language, hotwords, ...}}`
   147	- 2) 클라이언트 → 오디오 청크(PCM16/opus 등) 연속 전송
   148	- 3) 서버 → `{type:"partial", text, t0,t1}`(가설) / `{type:"final", segment, start, end, words[]}`(확정) / `{type:"status", ...}`
   149
   150	**Admin/관측:** `GET /health`, `GET /v1/system`(device 프로파일·워커수·큐 깊이), `GET /v1/models`.
   151
   152	**인증:** `X-API-Key: <key>`(또는 `Authorization: Bearer`). 키별 사용량/권한 확장 여지.
   153
   154	**요청 옵션 스키마(`options`):**
   155	```jsonc
   156	{
   157	  "language": "auto",                   // "ko" | "auto" | ...
   158	  "model": null,                         // null=경로별 기본(rt=turbo, batch=large-v3). 오버라이드 가능
   159	  "device": "auto",                      // "auto" | "cpu" | "cuda" | "cuda:0"
   160	  "compute_type": null,                  // null=자동. "float16"|"int8"|"int8_float16"
   161	  "timestamps": true,                    // 세그먼트 타임스탬프
   162	  "word_timestamps": false,              // 단어 단위
   163	  "diarize": false,                      // 화자 분리(pyannote, opt, HF 토큰)
   164	  "formats": ["json"],                   // ["txt","srt","vtt","json"]
   165	  "hotwords": ["vLLM","FastAPI","Kubernetes","LLM","GPU"],
   166	  "glossary_id": null,                   // 저장된 도메인 사전 참조
   167	  "vad": true,                           // 무음 제거
   168	  "post_correction": {                   // 단계 제어
   169	    "mode": "rules",                     // "none"|"glossary"|"rules"|"llm"
   170	    "backend": "local",                  // llm 모드 시: "local"|"openai"|"external"
   171	    "corrector_model": null              // 백엔드별 모델/엔드포인트
   172	  }
   173	}
   174	```
   175
   176	### 2) Transcription Engine (전사 엔진)
   177
   178	- **런타임:** **faster-whisper (CTranslate2)** — openai-whisper 대비 ~4배 빠르고 메모리 적음, GPU/CPU·fp16/int8 지원, Silero VAD 내장, 배치 추론 지원.
   179	- **모델 전략(확정): 하이브리드**
   180	  - **실시간 경로 = large-v3-turbo** (저지연; 디코더 4층 경량화).
   181	  - **배치 경로 = large-v3** (혼용어/다국어 정확도 우위).
   182	  - 두 모델 모두 설치, 경로별 기본값 적용. `model` 옵션/환경변수로 런타임 오버라이드.
   183	- **혼용어 대응(핵심):**
   184	  1. `hotwords`/`initial_prompt`에 도메인 용어 주입 → 기술용어 음차화 방지.
   185	  2. 저장 가능한 **Glossary**(도메인 사전) → `glossary_id`로 재사용.
   186	  3. (옵션) 후처리 LLM 보정으로 잔여 오류 교정.
   187	- **제외:** distil-whisper, NVIDIA Parakeet/Canary(영어 중심 → 한국어 혼용 부적합).
   188
   189	> ✅ **결정 근거:** 다국어/혼용어 정확도는 **large-v3가 turbo보다 우위**(turbo는 일부 언어 정확도 하락). 따라서 정확도 중요한 배치는 large-v3, 저지연 중요한 실시간은 turbo로 분리(하이브리드 확정). 추후 도메인 샘플 WER 벤치로 실시간 경로의 v3 승격 여부 재평가 가능.
   190
   191	### 3) Realtime Pipeline (실시간)
   192
   193	- WebSocket 오디오 프레임 → 링버퍼 → **Silero VAD**로 발화 구간 검출 → 청크 구성 → 전사 → 부분/최종 방출.
   194	- **안정화 정책:** LocalAgreement(연속 가설 일치분 확정) 또는 AlignAtt(2025 SOTA). 지연이 3~5초로 관대하므로 **큰 청크 + LocalAgreement-2**로 정확도 우선. 실시간 경로 기본 모델은 **turbo**.
   195	- **참고 구현:** [WhisperLive](https://github.com/collabora/WhisperLive)(faster-whisper 백엔드, WS, VAD), [WhisperLiveKit](https://github.com/QuentinFuxa/WhisperLiveKit)(AlignAtt), [whisper_streaming](https://github.com/ufal/whisper_streaming)(→ SimulStreaming 대체 추세). 정책 채택/재구현 모두 가능.
   196
   197	### 4) Output / Results (출력·보관)
   198
   199	- 텍스트 기본 + 요청 옵션별 타임스탬프/단어/화자/자막.
   200	- 포맷 변환: `json`(원천) → `txt`/`srt`/`vtt`/structured-`json`.
   201	- **보관(확정):** 결과/메타만 **7일** 보관(설정화·만료 자동삭제), **원본 오디오는 전사 직후 삭제**.
   202	- 저장소: 기본 로컬 파일/SQLite, 확장 시 S3/DB 가능.
   203
   204	### 5) Job Queue / Concurrency (큐·동시성)
   205
   206	- **Job 추상화:** 파일·실시간·영상 모두 Job으로 통일. 작업 중에도 신규 Job 계속 수신.
   207	- **우선순위 레인:** 실시간 세션=저지연 우선 / 배치=처리량 레인.
   208	- **워커풀:** 워커 수 = Device Manager 자동 산정. 각 워커가 디바이스 바인딩된 모델 인스턴스 보유.
   209	- **큐 백엔드(확정):** **Redis + RQ/Celery** 영속 큐를 처음부터 사용 → 재시작 내성·다중 워커 프로세스 지원. (개발/단발용 in-process 폴백 모드 제공.)
   210	- **진행률:** 장시간 파일은 VAD 세그먼트 분할 → `progress = 완료 세그먼트 / 전체`(또는 처리 오디오초/전체초). `queue_position` = 큐 인덱스.
   211	- **백프레셔:** 최대 큐 길이 초과 시 `429`.
   212
   213	### 6) Device Manager (자동 하드웨어 산정 — 설계 중심축)
   214
   215	**감지:** CUDA 가용 여부, device name, VRAM total/free, compute capability, CPU 코어/RAM.
   216
   217	**자동 결정 매트릭스:**
   218	| 감지 결과 | compute capability | 정밀도 | 워커/동시성 |
   219	|---|---|---|---|
   220	| H100 / A100 / L4 / T4 | 7.0+ (Tensor Core) | **float16**(또는 int8_float16) | `floor(free_VRAM / 워커당_추정)` 다중 워커 |
   221	| **GTX 1050** | 6.1 (Pascal, fp16 비효율) | **int8**, 부족 시 **CPU 폴백** | 1 워커 또는 CPU |
   222	| CUDA 없음 | — | **CPU int8** | CPU 코어 기반(1~2) |
   223
   224	- **모델 적재 검사:** `model_fp16 > free_VRAM` → int8 시도 → 그래도 부족 → CPU 폴백(또는 작은 모델 + 경고).
   225	- **참고치:** turbo ≈ 809M(fp16 ~3GB / int8 ~1.5GB), large-v3 ≈ 1.55B(fp16 ~6GB / int8 ~3GB). T4=16GB·L4=24GB·A100=40/80GB·H100=80GB → fp16 다중 워커. GTX 1050 2GB → int8 빠듯 → 실질 int8 1워커 또는 CPU.
   226	- **오버라이드:** `--device auto|cpu|cuda:N`, `--compute-type`, `--model`, `--workers`.
   227	- **CLI `detect`**로 프로파일·권장 설정 출력.
   228
   229	### 7) Post-processing (전사 오류 후처리)
   230
   231	순차 파이프라인(요청별 `post_correction.mode`로 단계 제어):
   232	1. **Glossary/Hotwords (예방, 항상 권장):** 디코드 시 도메인 용어 바이어스(가장 비용 효율적).
   233	2. **Rule/Dictionary 정규화(deterministic):** 알려진 오인식 → 표준 용어 치환, 정규식, 약어 대소문자 보정.
   234	3. **LLM 보정(확정: 백엔드 설정화, 기본 off·confidence-gated):** 저신뢰/고WER 구간만 교정(Judge-Editor: 고신뢰 스팬 유지, 불확실 스팬만 재작성). **백엔드 플러그형** — `local`(소형 LLM, 오프라인·프라이버시) 또는 `openai`/`external`(OpenAI 호환 엔드포인트), `corrector_model` 설정 가능. 기본 비활성(약 HW 보호·과교정 방지).
   235	4. **Confidence 플래깅:** 세그먼트별 신뢰도 부여, 저신뢰 구간 표시 → 선택적 휴먼 리뷰.
   236
   237	> ⚠️ **리서치 근거:** LLM 후처리는 **입력 WER이 높을 때(>10%)** WER을 크게 낮추지만, 이미 정확한 전사에는 **paraphrastic drift(과교정)** 위험 → 신뢰도 게이팅 필수. 도메인 고유명사/기술용어 손상이 핵심 위험(R-WER/EWER로 측정).
   238	> 🔒 **프라이버시:** `external`/`openai` 백엔드는 전사 텍스트를 외부로 전송하므로 내부 전용 정책과 상충 가능 → **기본은 `local`**, 외부 백엔드는 명시적 opt-in.
   239
   240	### 8) Connectivity / Tunnel (Colab 자동 외부 노출)
   241
   242	- **환경 자동 감지**(Colab/Kaggle/dev) → 옵션 시 터널 기동.
   243	- **기본: cloudflared Quick Tunnel** — `https://<random>.trycloudflare.com`, **계정/도메인 불필요**, 임시 URL, 제로 설정. (`--tunnel cloudflare`)
   244	- **대안: ngrok** — authtoken 필요, 무료는 재시작 시 URL 변경, 요청 인스펙션 제공. (`--tunnel ngrok --ngrok-token ...`)
   245	- **안정 도메인:** named Cloudflare Tunnel(CF 계정+도메인 필요).
   246	- **프로덕션/실IP:** `--tunnel none`, host IP 바인딩.
   247	- 기동 시 공개 URL + API Key 출력.
   248
   249	### 배포 (Deployment)
   250
   251	- **CLI (`run.sh` + `cli.py`)** — 개발·테스트·Colab.
   252	  - `serve`(API + 옵션 터널 기동) / `transcribe <file>`(단발) / `bench`(디바이스·모델 벤치) / `detect`(프로파일 출력)
   253	- **Docker (FastAPI + uvicorn/gunicorn)** — 프로덕션(내부).
   254	  - GPU 이미지(`nvidia/cuda` 베이스) + CPU 이미지(slim) 분리. `docker-compose`: API + **Redis**(영속 큐) + 워커, (옵션) LLM 보정 서비스.
   255	- **설정:** env + `.env`/yaml — model(rt/batch), device, workers, api_keys, retention_days, tunnel, redis_url, corrector.
   256
   257	### 기술 스택 (제안)
   258
   259	Python 3.11+, **FastAPI** + uvicorn, **faster-whisper(CTranslate2)** (turbo + large-v3), **ffmpeg**(영상→16kHz mono), **Silero VAD**(faster-whisper 내장), **Redis + RQ/Celery**(영속 큐), pydantic v2, **LLM 보정 백엔드**(local: llama.cpp/transformers · external: OpenAI 호환 client), (옵션) **pyannote.audio**(diarization, HF 토큰), **cloudflared**/pyngrok(터널), loguru/structlog(로깅), prometheus-client(메트릭, 옵션).
   260

 succeeded in 1ms:
     1	# Consensus Implementation Plan: luke_scribe — 로컬 STT 전사 API
     2
     3	- **Status:** `pending approval` (consensus **v2.1** — iteration 2 합의 도달; Architect+Critic 모두 APPROVE WITH CONDITIONS, 잔여조건 반영 완료)
     4	- **Mode:** `--consensus --direct --deliberate`
     5	- **Source spec:** `.omc/specs/deep-interview-luke-scribe-stt-api.md` (ambiguity ~10%, PASSED)
     6	- **Project:** greenfield `/root/luke_scribe`
     7	- **Generated:** 2026-06-02 · **Revised:** 2026-06-02 (v2)
     8
     9	> **v2 changelog는 문서 맨 끝 §13 참조.** v1 대비 P0 3건(워커 실행모델, VRAM 측정/단계역전, 취소·임시파일·보관경합), P1 4건, P2 3건, 모호 AC 4건을 반영.
    10
    11	---
    12
    13	## 1. Requirements Summary
    14
    15	내부용(비공개) 로컬 STT 전사 API. 단일 `Job` 추상화로 **배치(파일/영상)** 와 **실시간(WebSocket)** 입력 처리. faster-whisper(CTranslate2) 런타임, **하이브리드 모델**(실시간=turbo, 배치=large-v3) — 단 **P1 bench 게이트로 검증**. **Device Manager** 가 GPU/CPU를 감지하고 **부팅 시 VRAM 실측**으로 정밀도·워커수를 산정(1050~H100), `auto|cpu|cuda` 강제 가능. **Redis(RQ) 영속 큐**(배치) + **전용 실시간 핸들러**(WS), `queue_position`·`progress` 보고. 요청별 출력옵션(txt/ts/word/diarize/SRT/VTT). 후처리(glossary→rules→LLM(local/external, 기본 off·신뢰도 게이팅)→confidence). API Key 인증(+스코프), 결과 7일 보관·**파생 오디오 포함 즉시 삭제**, 4h/2GB → `413`, 큐 만재 → `429`. 배포: CLI(dev/Colab)+Docker(prod)+Colab cloudflared. diarization 옵션(pyannote).
    16
    17	---
    18
    19	## 2. RALPLAN-DR Summary
    20
    21	### Principles
    22	1. **Hardware-adaptive, fail-explicit** — 1050~H100 자동 감지·정밀도/동시성 산정. 적재 불가 시 **모델/정밀도 강등 → CPU**로 우아하게 내려가되, 강등이 불가능하면 **명확한 오류로 거부**(조용한 OOM·무한 강등 금지). "never fail"이 아니라 "never fail silently".
    23	2. **One Job abstraction, two execution lanes** — 모든 입력을 Job 수명주기(queued→processing→completed/failed/cancelled)로 통일하되, **배치=RQ 워커 / 실시간=장수명 WS 핸들러**로 실행 레인을 분리(WS는 enqueue-once가 아니므로).
    24	3. **Accuracy/latency 분리(검증 기반)** — batch=large-v3, realtime=turbo. 단 **하이브리드 채택은 P1 bench의 측정 델타로 게이트**(불충분하면 단일 모델로 단순화).
    25	4. **Privacy-first, enforced** — 원본+파생 오디오 즉시 삭제(모든 종료 경로 `finally`), 결과 7d TTL, 외부 LLM egress는 **allowlist+opt-in+감사로그** 없이는 금지.
    26	5. **Dev/prod parity** — 동일 코어, CLI(dev/Colab)/Docker(prod)는 설정 차이. 큐는 prod=RQ / dev=in-proc 폴백이되 **동일 Job 인터페이스 뒤**에 두어 의미 동등성 유지.
    27
    28	### Decision Drivers (top 3)
    29	1. **혼용어(KO+EN) 정확도** — hotwords + 모델 선택 + 후처리.
    30	2. **하드웨어 이식성 + 자동 스케일(실측 기반)**.
    31	3. **동시성 + 가시성(queue_position/progress)**.
    32
    33	### Viable Options
    34
    35	**D1. 큐/동시성 백엔드**
    36	- **(A) Redis + RQ `SimpleWorker`(no-fork) + 장수명 모델 보유 프로세스** ✅ *(채택, 사용자 Redis 확정 + fork 문제 회피)* — Pros: 영속·재시작 내성, 모델 1회 적재 후 재사용, CUDA-fork 충돌 회피. Cons: SimpleWorker는 작업 중 하트비트 없음 → progress emit로 보완 필요.
    37	- **(B) Redis + Celery** — Pros: 라우팅/우선순위/재시도 성숙. Cons: GPU 단일박스엔 과함. *Invalidation:* RQ+SimpleWorker로 충분.
    38	- **(C) in-process asyncio + GPU 세마포어** — dev/단발 폴백 + **P2 컨틴전시**(RQ/CUDA가 막히면 동일 Job 인터페이스로 폴백).
    39	> ⚠️ **근거:** RQ 기본 워커는 작업당 `os.fork()`하며, 부모에서 초기화된 CUDA 컨텍스트는 fork된 자식에서 재사용 불가([pytorch#40403](https://github.com/pytorch/pytorch/issues/40403)). 따라서 **fork 금지(SimpleWorker/장수명)** 가 필수. RQ progress는 내장이 없어 `job.save_meta()` 수동 호출 필요([RQ docs](https://python-rq.org/docs/jobs/)).
    40
    41	**D2. 실시간 스트리밍 구현**
    42	- **(A) faster-whisper 위 커스텀 LocalAgreement-2** ✅ *(채택)* — Pros: 큐/디바이스/후처리 통합 제어. Cons: 안정화 정확성 직접 구현(난이도 과소평가 금지 → §3.7c 계약 명시). 지연 3~5초 관대는 *지연*만 완화하지 *정확성*은 아님.
    43	- **(B) WhisperLive 백엔드 vendoring** — 검증된 WS+VAD를 backend로 감싸기. *Invalidation:* 통합·하이브리드 제어가 우선이나 **청킹/안정화 휴리스틱은 차용**.
    44	- **(C) WhisperLiveKit(AlignAtt/SimulStreaming)** — 2025 SOTA. *Invalidation:* 3~5초 목표엔 과투자(P5 옵션).
    45
    46	**D3. 추론 백엔드 추상화**
    47	- **(A) faster-whisper 단일 엔진 + compute_type 자동** ✅ — Pros: GPU/CPU/int8/fp16 단일 경로. Cons: 타 엔진 미지원(범위상 불필요).
    48	- **(B) 멀티 엔진 플러그인** — *Invalidation:* 조기 추상화 → 얇은 인터페이스(`engine/base.py`)만 두고 구현 1종.
    49
    50	---
    51
    52	## 3. Target Project Structure
    53
    54	```
    55	luke_scribe/
    56	├── pyproject.toml (uv; extras: gpu, diarize, llm)   ├── run.sh   ├── .env.example
    57	├── docker/{Dockerfile.gpu, Dockerfile.cpu, docker-compose.yml}
    58	└── src/luke_scribe/
    59	    ├── config.py        # pydantic-settings (model rt/batch, device, precision, redis, retention, api_keys+scopes, tunnel, corrector+allowlist)
    60	    ├── cli.py           # typer: serve | transcribe | bench | detect
    61	    ├── api/{app.py, deps.py, schemas.py, routes/{jobs.py, stream.py, admin.py}}
    62	    ├── devices/{manager.py, profile.py, vram_probe.py}
    63	    ├── engine/{base.py, faster_whisper_engine.py, model_registry.py}
    64	    ├── audio/{ingest.py, vad.py}
    65	    ├── pipeline/{batch.py, realtime.py}
    66	    ├── jobqueue/{broker.py, jobs.py, worker.py, inproc.py, cancel.py}
    67	    ├── postprocess/{pipeline.py, glossary.py, rules.py, llm.py, confidence.py}
    68	    ├── diarization/pyannote_diarizer.py
    69	    ├── results/{store.py, formats.py, retention.py}
    70	    ├── connectivity/tunnel.py
    71	    └── observability/{logging.py, metrics.py}
    72	```
    73
    74	### 3.5 Worker Execution Model & GPU Concurrency  *(P0-1 해소)*
    75	- **배치 레인:** RQ **`SimpleWorker`**(또는 장수명 커스텀 워커). 워커 부팅 시 `WhisperModel`을 **1회 적재**해 프로세스 수명 동안 보유(재적재·fork 금지). **GPU당 워커 프로세스 1개 기본**, 워커 내 GPU 접근은 단일 스레드(동시 decode 금지). 동시성 = **device-bound 워커 프로세스 수**(인터프로세스), `--workers` 오버라이드.
    76	- **실시간 레인:** WS 세션은 enqueue-once가 아니므로 RQ에 넣지 않고 **API 프로세스 내 장수명 turbo 핸들러**(asyncio + 단일 GPU 락)가 처리. 세션→Job 매핑은 상태 추적용으로만.
    77	- **하트비트/heartbeat 공백 보완:** SimpleWorker는 작업 중 하트비트가 없으므로 §3.7d의 throttled progress emit가 사실상 하트비트 역할(장시간 작업이 "멈춤"으로 오인되지 않게).
    78	- **컨틴전시:** GPU/RQ 조합이 막히면 D1-C(in-proc 워커풀)로 폴백, 동일 Job 인터페이스 유지.
    79
    80	### 3.6 VRAM Sizing & Model-Fit  *(P0-2 해소, 단계 역전 수정)*
    81	- **부팅 시 실측(`devices/vram_probe.py`):** 대상 모델/compute_type를 1회 로드 → `allocated` VRAM 측정 → 보수 헤드룸 ×1.3 저장. **정적 상수에 의존하지 않음.**
    82	- **보수 기본 상수(측정 전 폴백):** large-v3 fp16 ≈ **10GB**, large-v3 int8 ≈ **3.5GB**, turbo fp16 ≈ **4GB**, turbo int8 ≈ **1.8GB** (스펙의 6GB는 폐기).
    83	- **워커수 공식:** `workers = max(1, floor((free_VRAM − reserve) / measured_per_worker))`. 측정값 사용.
    84	- **Model-Fit 분기:** `measured_per_worker(min compute_type) > free_VRAM` → (1) 정밀도 강등(fp16→int8) → (2) **모델 강등**(large-v3→turbo) → (3) CPU → 그래도 불가면 **명확 오류**. **GTX 1050(2~4GB)은 large-v3 GPU 적재 불가**를 명시(배치=turbo-int8 또는 CPU).
    85	- **정밀도 기본:** cc≥7.0 & free≥12GB → `float16`; cc≥7.0 & free<12GB(T4 등) → **`int8_float16`**(워커 더 수용); Pascal(6.x) → `int8`; CPU → `int8`.
    86	- **OOM 처리(무한루프 방지):** OOM 시 강등 시도 **최대 2회**(fp16→int8→CPU) 후 작업 실패 처리(재큐는 1회 한). 강등은 해당 워커 프로세스 전역.
    87
    88	### 3.7 Job Lifecycle: Cancellation, Temp Files, Retention  *(P0-3 해소)*
    89	- **(a) 협조적 취소:** `DELETE /v1/jobs/{id}` → Redis에 cancel 플래그. 워커는 faster-whisper **세그먼트 제너레이터를 소비하며 세그먼트 경계마다 플래그 확인**(유일한 선점 지점). 현재 세그먼트 연산은 완료 후 중단 → 상태 `cancelled`. (긴급 hard-kill은 워커 프로세스 종료 옵션으로만.)
    90	- **(b) 임시파일 수명:** ffmpeg 파생 wav는 추적 tempdir에 생성, **모든 종료 경로(success/fail/cancel/OOM)의 `finally`에서 삭제**. 업로드 원본도 전사 시작 시점 이후 보유하다 종료 시 삭제.
    91	- **(c) 실시간 LocalAgreement 계약:** 설정 명시 — `redecode_window`(예: 마지막 15s 오디오), `confirmed_prefix` 절단 규칙, `retained_left_context`(예: 5s), VAD 무음 경계에서 확정. 확정 세그먼트 방출 후 버퍼 절단(메모리 평탄). 단위 테스트로 버퍼 절단 불변식 검증.
    92	- **(d) progress 발행:** 워커가 제너레이터를 소비하며 `processed_sec/total_sec` 계산 → **throttled `job.save_meta()`**(N 세그먼트마다 또는 ≥1s). `total_sec`는 ingest 시 duration probe로 확보. `queue_position` = 레인 큐 인덱스.
    93	- **(e) 보관 sweeper 경합:** `results/retention.py`는 **터미널 상태(completed/failed/cancelled) Job만** 7d TTL 청소. {queued, processing} 보유 결과·임시물은 건드리지 않음.
    94
    95	### 3.8 Security Boundary  *(P1-7 해소)*
    96	- **API Key + 스코프:** `X-API-Key` 검증 + `ApiKey.scopes` 강제(예: `transcribe`, `admin`). 키 회전/폐기 설정.
    97	- **외부 egress 통제:** LLM `external`/`openai` 백엔드는 **config allowlist 엔드포인트에만** 송신 가능(SSRF 방지), **기본 off + 명시 opt-in + 전송 1건당 감사 로그(key id, endpoint, job id)**. 옵션으로 전송 전 PII 마스킹.
    98
    99	### 3.9 Shared-GPU Accounting & Realtime Concurrency  *(v2.1 — 합의 잔여조건)*
   100	- **(a) 이벤트 루프 비블로킹:** 실시간 turbo decode는 동기 CTranslate2 호출이므로 **`await loop.run_in_executor(single_thread_executor, decode)`** 로 오프로딩(단일 GPU 락 직렬성 유지). P3 착수 시 CT2의 GIL 해제 여부 검증 — 미해제면 실시간 레인을 **별도 디코드 프로세스**로 분리. *(NEW-1a)*
   101	- **(b) 공유 GPU VRAM 회계:** `reserve = base_headroom + (realtime_enabled ? measured_realtime_vram : 0)`. 실시간 모델이 동일 GPU 상주 시 배치 워커수 공식(§3.6)이 그 footprint를 반드시 포함 → 단일 GPU + 실시간 동시 활성에서 oversubscribe 금지. *(NEW-1b)*
   102	- **(c) 실시간 동시 세션:** turbo 인스턴스 1개를 전 세션이 **직렬 공유**(단일 GPU 락). **최대 동시 WS 세션 상한**(설정) + 초과 시 거부/대기, 실시간 레인 대기시간 메트릭. AC-8(≤5s)은 "≤N 세션 한도 내" 보장으로 명시. *(R3)*
   103	- **(d) 실효 compute_type 로깅:** `vram_probe`/`/v1/system`이 **요청 vs 실효 compute_type** 보고, 불일치(T4 `int8_float16` 무음 강등 등) 경고. AC-2/3 계약에 포함. *(NEW-2/R2)*
   104	- **(e) RQ job_timeout:** enqueue 시 `job_timeout ≥ 4h(+마진)`(duration probe 기반). RQ 기본 180s로는 장시간 작업이 3분에 강제 종료되어 AC-7 위반 → 반드시 상향. *(R1/NEW-3)*
   105	- **(f) Phase Exit 구속력:** 각 Phase Exit는 **hard gate**(미충족 시 다음 Phase 착수 금지). 단 옵션 기능(diarization/LLM/tunnel) 미완은 "문서화된 제한"으로 soft 허용. *(R4)*
   106
   107	---
   108
   109	## 4. Implementation Steps (by phase, with file refs & exit criteria)
   110
   111	### P1 — Core + 측정 게이트
   112	1. **스캐폴딩** — `pyproject.toml`(extras), `config.py`, `run.sh`.
   113	2. **Device Manager + VRAM probe** — `devices/{manager.py, vram_probe.py, profile.py}`: 감지 + §3.6 실측·정밀도·워커수·Model-Fit 분기. AC-2/3.
   114	3. **Engine + registry** — `engine/faster_whisper_engine.py`(transcribe: hotwords/initial_prompt/word_ts/vad), `model_registry.py`(rt=turbo/batch=large-v3, 오버라이드). AC-4.
   115	4. **Audio ingest(스트리밍)** — `audio/ingest.py`: **ffmpeg를 파일로 파이프**(전체 배열 인메모리 금지), duration/size probe, 4h/2GB→`413`. `audio/vad.py`: Silero VAD. AC-7/9.
   116	5. **CLI `detect`/`transcribe`/`bench`** — `bench`를 **P1로 전진**: 도메인 KO+EN 클립으로 turbo vs large-v3 **R-WER + entity 보존율 + 속도 + 실측 VRAM** 측정 → 하이브리드 게이트 판정. AC-4/12.
   117	- **Exit:** CPU와 실 GPU 1종에서 단일 파일 전사 성공; `detect`가 measured VRAM·정밀도·워커수 출력; `bench`가 모델 델타 리포트 산출.
   118
   119	### P2 — API + Queue
   120	6. **FastAPI + 인증** — `api/{app.py, deps.py, schemas.py}`: API Key+스코프(§3.8). AC-11.
   121	7. **RQ 워커(SimpleWorker)** — `jobqueue/{broker.py, jobs.py, worker.py, cancel.py, inproc.py}`: §3.5 실행모델, §3.7d progress, §3.7a 취소, §3.6 OOM 강등. AC-5/6.
   122	8. **Jobs 라우트** — `api/routes/jobs.py`: `POST /v1/jobs`(만재→`429`), `GET`(queue_position/progress), `result?format=`, `DELETE`(취소), `GET /v1/jobs`. AC-1/6.
   123	9. **Results + retention** — `results/{store.py, formats.py, retention.py}`: §3.7b/e, 원본·파생 삭제, 7d TTL 터미널만. AC-11.
   124	10. **Docker** — `Dockerfile.{gpu,cpu}` + compose(api+redis+worker). **CT2/CUDA/cuDNN 트리플 핀**(예: CUDA12+cuDNN9; 구형/Colab은 CT2 다운그레이드 경로 문서화). `detect`가 런타임 CUDA 버전 노출. AC-12.
   125	11. **Admin** — `/health`, `/v1/system`(profile/워커/큐깊이/가용 VRAM), `/v1/models`.
   126	- **Exit:** 다건 enqueue→progress→result; **강제 OOM 시 강등·실패 경로 시연**; 취소가 `cancelled`로 종료 + 임시파일 삭제 확인.
   127
   128	### P3 — Realtime
   129	12. **실시간** — `pipeline/realtime.py`(§3.7c 계약: redecode_window/prefix/left-context/VAD), `api/routes/stream.py`(WS, partial/final/status, 백프레셔). AC-8.
   130	- **Exit:** 부분결과 ≤5s + 최종 안정화; **30분 세션 메모리 평탄**(±15% 이내, 단조증가 없음).
   131
   132	### P4 — Output + Post-processing
   133	13. **출력 옵션** — timestamps/word/SRT/VTT 요청별. AC-9.
   134	14. **후처리(glossary/rules/flag)** — `postprocess/{pipeline.py, glossary.py, rules.py, confidence.py}`. AC-10.
   135	- **Exit:** glossary on/off diff로 entity 보존 향상 측정; 저신뢰 플래그 부착.
   136
   137	### P5 — Advanced
   138	15. **LLM 보정(옵션)** — `postprocess/llm.py`: local/external 백엔드, §3.8 egress 통제, confidence-gated, 기본 off. AC-10.
   139	16. **Diarization(옵션)** — `diarization/pyannote_diarizer.py`(HF 토큰).
   140	17. **Colab 터널** — `connectivity/tunnel.py`: **API lifespan에 종속 supervise**(같이 start/stop), URL 회전 시 재출력, 임시성 명시. AC-13.
   141	18. **관측/벤치 확장** — `observability/{logging,metrics}.py`(큐깊이·워커가동·RTF·OOM 카운트), `bench` 확장.
   142	- **Exit:** external egress allowlist+감사로그 동작; Colab `serve --tunnel cloudflare` 외부 200.
   143
   144	---
   145
   146	## 5. Acceptance Criteria (수치화)
   147
   148	스펙 AC-1~13 상속 + **모호 항목 절대 기준화**:
   149	- **AC-2/3:** `detect`가 measured VRAM 기반 정밀도/워커수 산정; 1050→int8/CPU(large-v3 GPU 불가 명시), T4→int8_float16, A100/H100→fp16; `floor` 공식이 측정 상수 사용.
   150	- **AC-4(혼용어, 절대):** 도메인 entity 용어(vLLM·API·FastAPI·Kubernetes·LLM·GPU 등) **verbatim 보존율 ≥ 95%**(hotwords on, domain set), 그리고 도메인 **R-WER ≤ {P1 bench 기준선}**. 보조로 batch(v3) ≤ realtime(turbo).
   151	- **AC-5/6:** 다건 동시 → 각 `GET`이 `queue_position`(앞 N건)·`progress %`(processed_sec/total_sec); 만재 `429`.
   152	- **AC-7(메모리, 절대):** 2h 배치 + 30분 WS에서 **워밍업 후 peak RSS 변동 ±15% 이내, 단조 증가 없음**; OOM 없음; 4h/2GB→`413`.
   153	- **AC-8:** 부분결과 ≤5s(동시 ≤N 세션 한도 내), 최종 안정화; 다중 세션 부하에서도 REST/배치 응답성 유지(이벤트 루프 비블로킹, §3.9a).
   154	- **AC-11:** 전사 후 원본+파생 wav 부재; 결과 7d 후 만료; 터미널 Job만 청소.
   155
   156	---
   157
   158	## 6. Risks and Mitigations
   159	| Risk | Mitigation |
   160	|------|-----------|
   161	| CUDA fork 실패 | SimpleWorker/장수명·model-load-once, fork 금지(§3.5) |
   162	| OOM(동시성) | **측정 VRAM**(§3.6) 기반 워커수, semaphore, 강등 최대 2회 후 실패 |
   163	| turbo 혼용어 부족 | P1 bench 게이트(§4-5), 하이브리드/hotwords/후처리, 모델 스왑 |
   164	| 실시간 떨림/누수 | LocalAgreement 계약(§3.7c), 버퍼 절단 불변식 테스트 |
   165	| 취소 no-op / 임시파일 누출 | 협조적 취소 + `finally` 삭제(§3.7a/b) |
   166	| 보관 경합 | 터미널 상태만 sweeper(§3.7e) |
   167	| 외부 LLM PII 유출 | allowlist+opt-in+감사로그+마스킹(§3.8) |
   168	| CT2/cuDNN 버전 불일치 | Dockerfile 트리플 핀 + 다운그레이드 경로(P2-10) |
   169	| Redis SPOF | in-proc 폴백(D1-C), 헬스체크 |
   170
   171	## 7. Pre-mortem (deliberate — 3 시나리오, 위험표와 비중복)
   172	1. **"P2 통합에서 워커가 fork-CUDA로 즉시 죽었다."** 원인: RQ 기본 fork. 예방: §3.5 SimpleWorker 강제 + P2 Exit에 "실 GPU 워커 기동" 게이트, P1에서 워커 기동 스파이크 선검증.
   173	2. **"데모에서 'vLLM'→'브이엘엘엠'으로 신뢰 상실."** 원인: 모델/hotwords 미검증. 예방: **P1 bench 게이트**(entity 보존율 ≥95% 기준), 기본 hotwords 사전 동봉, 미달 시 v3 채택 자동 판정.
   174	3. **"30분 WS 세션에서 메모리 누수로 컨테이너 OOM-kill."** 원인: 링버퍼/가설 미절단. 예방: §3.7c 절단 계약 + 버퍼 불변식 단위테스트 + P3 Exit의 ±15% 메모리 게이트.
   175
   176	## 8. Expanded Test Plan (deliberate, 수치 게이트 포함)
   177	- **Unit:** Device 결정(cc/measured-VRAM→compute_type/workers/Model-Fit 분기), OOM 강등 캡(≤2), LocalAgreement 버퍼 절단 불변식, formats(srt/vtt 타임코드), rules 정규화, retention(터미널만), options(413/429).
   178	- **Integration:** ffmpeg(스트리밍, 인메모리 아님)→engine→result, Redis enqueue→SimpleWorker→progress(save_meta throttle)→status, 취소 플래그→`cancelled`+임시파일 삭제, glossary diff(entity 보존율), egress allowlist 차단/허용.
   179	- **E2E:** 파일 Job 수명주기(생성→progress→결과·원본+파생 삭제), 동시 다건 queue_position, WS 세션(부분≤5s), Colab `serve --tunnel cloudflare` 200, 강제 OOM 강등 경로.
   180	- **Observability:** 메트릭(큐깊이·워커가동률·RTF·OOM 카운트·**실시간 레인 대기시간**), 구조적 로그(job_id 상관), `/health`·`/v1/system` 계약(**요청 vs 실효 compute_type 보고 검증**), **2h 배치/30분 WS RSS ±15% 평탄 검증**, external egress 감사로그 1건/전송, **다중 세션 AC-8(≤5s) + 동시 `GET /v1/jobs` 응답성 바운드**, **>180s 작업이 job_timeout으로 강제종료되지 않음**, **공유 GPU(배치+실시간) VRAM 비-oversubscribe Model-Fit 상호작용 테스트**.
   181
   182	## 9. Verification Steps
   183	1. `uv sync && uv run python -m luke_scribe.cli detect` → measured VRAM·정밀도·워커수.
   184	2. `... bench --samples samples/ko_en/` → turbo vs v3 R-WER·entity 보존율·VRAM → 하이브리드 판정.
   185	3. `... transcribe samples/ko_en.wav` → entity verbatim ≥95% 확인.
   186	4. `docker compose up` → 다건 `POST`/`GET`(progress/위치), 결과 후 원본+파생 부재, `DELETE`→`cancelled`.
   187	5. 강제 작은-VRAM 환경 → 강등(≤2)·실패 경로 시연.
   188	6. WS 클라이언트 30분 → 부분≤5s, RSS ±15%.
   189	7. `pytest tests/`(unit/integration) + e2e 스모크.
   190	8. Colab → `serve --tunnel cloudflare` URL 외부 200; egress allowlist 차단 테스트.
   191
   192	## 10. ADR (refined)
   193	- **Decision:** faster-whisper 단일 엔진 + (게이트된)하이브리드 모델 + Redis(RQ **SimpleWorker**, no-fork, model-load-once) 배치 + 전용 WS 실시간 핸들러 + **측정 기반 DeviceProfile** + CLI/Docker(버전 핀).
   194	- **Drivers:** 혼용어 정확도, 이식성/자동스케일(실측), 동시성/가시성.
   195	- **Alternatives:** Celery(과함), in-proc(폴백/컨틴전시), WhisperLive/Kit(참고/P5), 멀티엔진(조기), RQ 기본 fork(CUDA 불가→거부).
   196	- **Why chosen:** 사용자 확정(하이브리드/Redis/후처리/보관/상한/diarize) 준수 + fork/VRAM/취소 리스크를 명시 메커니즘으로 해소.
   197	- **Consequences:** SimpleWorker 하트비트 공백→progress로 보완; 실시간 안정화 자체 구현; 1050 배치 large-v3 GPU 불가; Redis 의존; 공유 GPU에서 배치+실시간 VRAM 합산 회계(§3.9b)·실시간 decode 오프로딩(§3.9a) 필요.
   198	- **Follow-ups:** Redis HA, SimulStreaming(P5), egress 마스킹 고도화. (도메인 bench는 follow-up이 아니라 **P1 게이트**로 승격.)
   199
   200	---
   201
   202	## 11. Open Questions (사용자 확인 권장)
   203	1. 실제 배포가 **다중 GPU 워커**를 필요로 하나, 아니면 단일 T4/Colab 위주? (후자면 워커수 공식 위험 축소.)
   204	2. turbo의 KO entity 보존율이 P1 bench에서 ≥95%면 **단일 모델로 단순화**할 의향이 있는지(VRAM/복잡도 절감).
   205	3. 취소는 **협조적(세그먼트 경계)** 으로 충분한지, 즉시 hard-kill이 필요한지.
   206
   207	---
   208
   209	## 12. v2 Changelog (적용 내역 → 리뷰 매핑)
   210	- **[P0-1]** §3.5 Worker Execution Model 신설 — SimpleWorker/no-fork/model-load-once, 배치·실시간 레인 분리, in-proc 컨틴전시. (Arch P0-1, Critic F1)
   211	- **[P0-2]** §3.6 VRAM 부팅 실측 + 보수 상수(large-v3 fp16 10GB) + Model-Fit 분기 + `bench`를 **P1로 전진**(단계 역전 수정). (Arch P0-2/P1-4, Critic F2)
   212	- **[P0-3]** §3.7 협조적 취소 + 임시파일 `finally` 삭제 + 보관 sweeper 터미널-한정. (Arch P0-3/P1-6, Critic F3)
   213	- **[P1]** §3.7d progress 메커니즘(save_meta throttle), §3.8 보안 egress allowlist+감사+스코프, 하이브리드 P1 게이트. (Arch P1-4/5/7, Critic F4/F5)
   214	- **[P1-신규]** §4 단계별 **Exit Criteria** + in-proc 컨틴전시 명시. (Critic F6)
   215	- **[P2]** §3.7c LocalAgreement 계약, §4-4 ffmpeg 스트리밍 보장, §4-10/17 cloudflared lifecycle + CT2/CUDA/cuDNN 핀. (Arch P2-8/9/10)
   216	- **[AC]** AC-4 절대 기준(entity 보존율 ≥95% + R-WER 기준선), AC-7 메모리 ±15% 평탄, 워커당_추정=측정값, OOM 강등 ≤2회, `429`/`413` 분리. (Critic 모호성 4건)
   217	- **[Risk]** OOM 위험을 측정 상수 기반으로 재작성(순환성 제거); pre-mortem #1을 위험표와 비중복화하고 예방을 P1/P2 게이트에 배선.
   218
   219	---
   220
   221	## 13. v2.1 Changelog (합의 잔여조건 반영)
   222	- **[NEW-1a]** §3.9a 실시간 decode `run_in_executor` 오프로딩(+GIL 검증/프로세스 분리 폴백).
   223	- **[NEW-1b]** §3.9b `reserve`에 실시간 모델 VRAM 포함 → 공유 GPU oversubscribe 방지.
   224	- **[R1/NEW-3]** §3.9e RQ `job_timeout ≥ 4h`.
   225	- **[NEW-2/R2]** §3.9d 요청 vs 실효 compute_type 로깅 + AC-2/3 계약 + 관측 테스트.
   226	- **[R3]** §3.9c 실시간 최대 동시 세션 상한 + 대기시간 메트릭 + 다중세션 AC-8 테스트.
   227	- **[R4]** §3.9f Phase Exit 구속력(hard/soft) 명시.
   228
   229	## 합의 결과 (Consensus Outcome)
   230	- **Architect (v2): APPROVE WITH CONDITIONS** — P0×3 해소 확인, 잔여 NEW-1(a/b)/2/3.
   231	- **Critic (v2): APPROVE WITH CONDITIONS** — CRITICAL×3·MAJOR×3 해소 확인, NEW-1 검증 + R1~R4.
   232	- 모든 잔여 조건을 **v2.1**에 반영 → **합의 도달**. iteration 2/5.
   233
   234	---
   235	*Consensus reached (v2.1) — `pending approval`. 실행(team/ralph/autopilot)은 사용자의 별도 명시 승인이 있어야만 진행됩니다. 이 계획 모듈은 승인 전 소스 코드 수정·커밋·실행 스킬 호출을 하지 않습니다.*

codex
I’ve got both documents loaded. Next pass is narrow: I’m checking for under-specified areas around multi-model GPU residency, dependency/runtime pinning, security controls, and whether the test gates actually prove the risky parts.
exec
/bin/bash -lc "sed -n '1,260p' .omc/plans/consensus-luke-scribe-stt-api.md" in /root/luke_scribe
 succeeded in 0ms:
# Consensus Implementation Plan: luke_scribe — 로컬 STT 전사 API

- **Status:** `pending approval` (consensus **v2.1** — iteration 2 합의 도달; Architect+Critic 모두 APPROVE WITH CONDITIONS, 잔여조건 반영 완료)
- **Mode:** `--consensus --direct --deliberate`
- **Source spec:** `.omc/specs/deep-interview-luke-scribe-stt-api.md` (ambiguity ~10%, PASSED)
- **Project:** greenfield `/root/luke_scribe`
- **Generated:** 2026-06-02 · **Revised:** 2026-06-02 (v2)

> **v2 changelog는 문서 맨 끝 §13 참조.** v1 대비 P0 3건(워커 실행모델, VRAM 측정/단계역전, 취소·임시파일·보관경합), P1 4건, P2 3건, 모호 AC 4건을 반영.

---

## 1. Requirements Summary

내부용(비공개) 로컬 STT 전사 API. 단일 `Job` 추상화로 **배치(파일/영상)** 와 **실시간(WebSocket)** 입력 처리. faster-whisper(CTranslate2) 런타임, **하이브리드 모델**(실시간=turbo, 배치=large-v3) — 단 **P1 bench 게이트로 검증**. **Device Manager** 가 GPU/CPU를 감지하고 **부팅 시 VRAM 실측**으로 정밀도·워커수를 산정(1050~H100), `auto|cpu|cuda` 강제 가능. **Redis(RQ) 영속 큐**(배치) + **전용 실시간 핸들러**(WS), `queue_position`·`progress` 보고. 요청별 출력옵션(txt/ts/word/diarize/SRT/VTT). 후처리(glossary→rules→LLM(local/external, 기본 off·신뢰도 게이팅)→confidence). API Key 인증(+스코프), 결과 7일 보관·**파생 오디오 포함 즉시 삭제**, 4h/2GB → `413`, 큐 만재 → `429`. 배포: CLI(dev/Colab)+Docker(prod)+Colab cloudflared. diarization 옵션(pyannote).

---

## 2. RALPLAN-DR Summary

### Principles
1. **Hardware-adaptive, fail-explicit** — 1050~H100 자동 감지·정밀도/동시성 산정. 적재 불가 시 **모델/정밀도 강등 → CPU**로 우아하게 내려가되, 강등이 불가능하면 **명확한 오류로 거부**(조용한 OOM·무한 강등 금지). "never fail"이 아니라 "never fail silently".
2. **One Job abstraction, two execution lanes** — 모든 입력을 Job 수명주기(queued→processing→completed/failed/cancelled)로 통일하되, **배치=RQ 워커 / 실시간=장수명 WS 핸들러**로 실행 레인을 분리(WS는 enqueue-once가 아니므로).
3. **Accuracy/latency 분리(검증 기반)** — batch=large-v3, realtime=turbo. 단 **하이브리드 채택은 P1 bench의 측정 델타로 게이트**(불충분하면 단일 모델로 단순화).
4. **Privacy-first, enforced** — 원본+파생 오디오 즉시 삭제(모든 종료 경로 `finally`), 결과 7d TTL, 외부 LLM egress는 **allowlist+opt-in+감사로그** 없이는 금지.
5. **Dev/prod parity** — 동일 코어, CLI(dev/Colab)/Docker(prod)는 설정 차이. 큐는 prod=RQ / dev=in-proc 폴백이되 **동일 Job 인터페이스 뒤**에 두어 의미 동등성 유지.

### Decision Drivers (top 3)
1. **혼용어(KO+EN) 정확도** — hotwords + 모델 선택 + 후처리.
2. **하드웨어 이식성 + 자동 스케일(실측 기반)**.
3. **동시성 + 가시성(queue_position/progress)**.

### Viable Options

**D1. 큐/동시성 백엔드**
- **(A) Redis + RQ `SimpleWorker`(no-fork) + 장수명 모델 보유 프로세스** ✅ *(채택, 사용자 Redis 확정 + fork 문제 회피)* — Pros: 영속·재시작 내성, 모델 1회 적재 후 재사용, CUDA-fork 충돌 회피. Cons: SimpleWorker는 작업 중 하트비트 없음 → progress emit로 보완 필요.
- **(B) Redis + Celery** — Pros: 라우팅/우선순위/재시도 성숙. Cons: GPU 단일박스엔 과함. *Invalidation:* RQ+SimpleWorker로 충분.
- **(C) in-process asyncio + GPU 세마포어** — dev/단발 폴백 + **P2 컨틴전시**(RQ/CUDA가 막히면 동일 Job 인터페이스로 폴백).
> ⚠️ **근거:** RQ 기본 워커는 작업당 `os.fork()`하며, 부모에서 초기화된 CUDA 컨텍스트는 fork된 자식에서 재사용 불가([pytorch#40403](https://github.com/pytorch/pytorch/issues/40403)). 따라서 **fork 금지(SimpleWorker/장수명)** 가 필수. RQ progress는 내장이 없어 `job.save_meta()` 수동 호출 필요([RQ docs](https://python-rq.org/docs/jobs/)).

**D2. 실시간 스트리밍 구현**
- **(A) faster-whisper 위 커스텀 LocalAgreement-2** ✅ *(채택)* — Pros: 큐/디바이스/후처리 통합 제어. Cons: 안정화 정확성 직접 구현(난이도 과소평가 금지 → §3.7c 계약 명시). 지연 3~5초 관대는 *지연*만 완화하지 *정확성*은 아님.
- **(B) WhisperLive 백엔드 vendoring** — 검증된 WS+VAD를 backend로 감싸기. *Invalidation:* 통합·하이브리드 제어가 우선이나 **청킹/안정화 휴리스틱은 차용**.
- **(C) WhisperLiveKit(AlignAtt/SimulStreaming)** — 2025 SOTA. *Invalidation:* 3~5초 목표엔 과투자(P5 옵션).

**D3. 추론 백엔드 추상화**
- **(A) faster-whisper 단일 엔진 + compute_type 자동** ✅ — Pros: GPU/CPU/int8/fp16 단일 경로. Cons: 타 엔진 미지원(범위상 불필요).
- **(B) 멀티 엔진 플러그인** — *Invalidation:* 조기 추상화 → 얇은 인터페이스(`engine/base.py`)만 두고 구현 1종.

---

## 3. Target Project Structure

```
luke_scribe/
├── pyproject.toml (uv; extras: gpu, diarize, llm)   ├── run.sh   ├── .env.example
├── docker/{Dockerfile.gpu, Dockerfile.cpu, docker-compose.yml}
└── src/luke_scribe/
    ├── config.py        # pydantic-settings (model rt/batch, device, precision, redis, retention, api_keys+scopes, tunnel, corrector+allowlist)
    ├── cli.py           # typer: serve | transcribe | bench | detect
    ├── api/{app.py, deps.py, schemas.py, routes/{jobs.py, stream.py, admin.py}}
    ├── devices/{manager.py, profile.py, vram_probe.py}
    ├── engine/{base.py, faster_whisper_engine.py, model_registry.py}
    ├── audio/{ingest.py, vad.py}
    ├── pipeline/{batch.py, realtime.py}
    ├── jobqueue/{broker.py, jobs.py, worker.py, inproc.py, cancel.py}
    ├── postprocess/{pipeline.py, glossary.py, rules.py, llm.py, confidence.py}
    ├── diarization/pyannote_diarizer.py
    ├── results/{store.py, formats.py, retention.py}
    ├── connectivity/tunnel.py
    └── observability/{logging.py, metrics.py}
```

### 3.5 Worker Execution Model & GPU Concurrency  *(P0-1 해소)*
- **배치 레인:** RQ **`SimpleWorker`**(또는 장수명 커스텀 워커). 워커 부팅 시 `WhisperModel`을 **1회 적재**해 프로세스 수명 동안 보유(재적재·fork 금지). **GPU당 워커 프로세스 1개 기본**, 워커 내 GPU 접근은 단일 스레드(동시 decode 금지). 동시성 = **device-bound 워커 프로세스 수**(인터프로세스), `--workers` 오버라이드.
- **실시간 레인:** WS 세션은 enqueue-once가 아니므로 RQ에 넣지 않고 **API 프로세스 내 장수명 turbo 핸들러**(asyncio + 단일 GPU 락)가 처리. 세션→Job 매핑은 상태 추적용으로만.
- **하트비트/heartbeat 공백 보완:** SimpleWorker는 작업 중 하트비트가 없으므로 §3.7d의 throttled progress emit가 사실상 하트비트 역할(장시간 작업이 "멈춤"으로 오인되지 않게).
- **컨틴전시:** GPU/RQ 조합이 막히면 D1-C(in-proc 워커풀)로 폴백, 동일 Job 인터페이스 유지.

### 3.6 VRAM Sizing & Model-Fit  *(P0-2 해소, 단계 역전 수정)*
- **부팅 시 실측(`devices/vram_probe.py`):** 대상 모델/compute_type를 1회 로드 → `allocated` VRAM 측정 → 보수 헤드룸 ×1.3 저장. **정적 상수에 의존하지 않음.**
- **보수 기본 상수(측정 전 폴백):** large-v3 fp16 ≈ **10GB**, large-v3 int8 ≈ **3.5GB**, turbo fp16 ≈ **4GB**, turbo int8 ≈ **1.8GB** (스펙의 6GB는 폐기).
- **워커수 공식:** `workers = max(1, floor((free_VRAM − reserve) / measured_per_worker))`. 측정값 사용.
- **Model-Fit 분기:** `measured_per_worker(min compute_type) > free_VRAM` → (1) 정밀도 강등(fp16→int8) → (2) **모델 강등**(large-v3→turbo) → (3) CPU → 그래도 불가면 **명확 오류**. **GTX 1050(2~4GB)은 large-v3 GPU 적재 불가**를 명시(배치=turbo-int8 또는 CPU).
- **정밀도 기본:** cc≥7.0 & free≥12GB → `float16`; cc≥7.0 & free<12GB(T4 등) → **`int8_float16`**(워커 더 수용); Pascal(6.x) → `int8`; CPU → `int8`.
- **OOM 처리(무한루프 방지):** OOM 시 강등 시도 **최대 2회**(fp16→int8→CPU) 후 작업 실패 처리(재큐는 1회 한). 강등은 해당 워커 프로세스 전역.

### 3.7 Job Lifecycle: Cancellation, Temp Files, Retention  *(P0-3 해소)*
- **(a) 협조적 취소:** `DELETE /v1/jobs/{id}` → Redis에 cancel 플래그. 워커는 faster-whisper **세그먼트 제너레이터를 소비하며 세그먼트 경계마다 플래그 확인**(유일한 선점 지점). 현재 세그먼트 연산은 완료 후 중단 → 상태 `cancelled`. (긴급 hard-kill은 워커 프로세스 종료 옵션으로만.)
- **(b) 임시파일 수명:** ffmpeg 파생 wav는 추적 tempdir에 생성, **모든 종료 경로(success/fail/cancel/OOM)의 `finally`에서 삭제**. 업로드 원본도 전사 시작 시점 이후 보유하다 종료 시 삭제.
- **(c) 실시간 LocalAgreement 계약:** 설정 명시 — `redecode_window`(예: 마지막 15s 오디오), `confirmed_prefix` 절단 규칙, `retained_left_context`(예: 5s), VAD 무음 경계에서 확정. 확정 세그먼트 방출 후 버퍼 절단(메모리 평탄). 단위 테스트로 버퍼 절단 불변식 검증.
- **(d) progress 발행:** 워커가 제너레이터를 소비하며 `processed_sec/total_sec` 계산 → **throttled `job.save_meta()`**(N 세그먼트마다 또는 ≥1s). `total_sec`는 ingest 시 duration probe로 확보. `queue_position` = 레인 큐 인덱스.
- **(e) 보관 sweeper 경합:** `results/retention.py`는 **터미널 상태(completed/failed/cancelled) Job만** 7d TTL 청소. {queued, processing} 보유 결과·임시물은 건드리지 않음.

### 3.8 Security Boundary  *(P1-7 해소)*
- **API Key + 스코프:** `X-API-Key` 검증 + `ApiKey.scopes` 강제(예: `transcribe`, `admin`). 키 회전/폐기 설정.
- **외부 egress 통제:** LLM `external`/`openai` 백엔드는 **config allowlist 엔드포인트에만** 송신 가능(SSRF 방지), **기본 off + 명시 opt-in + 전송 1건당 감사 로그(key id, endpoint, job id)**. 옵션으로 전송 전 PII 마스킹.

### 3.9 Shared-GPU Accounting & Realtime Concurrency  *(v2.1 — 합의 잔여조건)*
- **(a) 이벤트 루프 비블로킹:** 실시간 turbo decode는 동기 CTranslate2 호출이므로 **`await loop.run_in_executor(single_thread_executor, decode)`** 로 오프로딩(단일 GPU 락 직렬성 유지). P3 착수 시 CT2의 GIL 해제 여부 검증 — 미해제면 실시간 레인을 **별도 디코드 프로세스**로 분리. *(NEW-1a)*
- **(b) 공유 GPU VRAM 회계:** `reserve = base_headroom + (realtime_enabled ? measured_realtime_vram : 0)`. 실시간 모델이 동일 GPU 상주 시 배치 워커수 공식(§3.6)이 그 footprint를 반드시 포함 → 단일 GPU + 실시간 동시 활성에서 oversubscribe 금지. *(NEW-1b)*
- **(c) 실시간 동시 세션:** turbo 인스턴스 1개를 전 세션이 **직렬 공유**(단일 GPU 락). **최대 동시 WS 세션 상한**(설정) + 초과 시 거부/대기, 실시간 레인 대기시간 메트릭. AC-8(≤5s)은 "≤N 세션 한도 내" 보장으로 명시. *(R3)*
- **(d) 실효 compute_type 로깅:** `vram_probe`/`/v1/system`이 **요청 vs 실효 compute_type** 보고, 불일치(T4 `int8_float16` 무음 강등 등) 경고. AC-2/3 계약에 포함. *(NEW-2/R2)*
- **(e) RQ job_timeout:** enqueue 시 `job_timeout ≥ 4h(+마진)`(duration probe 기반). RQ 기본 180s로는 장시간 작업이 3분에 강제 종료되어 AC-7 위반 → 반드시 상향. *(R1/NEW-3)*
- **(f) Phase Exit 구속력:** 각 Phase Exit는 **hard gate**(미충족 시 다음 Phase 착수 금지). 단 옵션 기능(diarization/LLM/tunnel) 미완은 "문서화된 제한"으로 soft 허용. *(R4)*

---

## 4. Implementation Steps (by phase, with file refs & exit criteria)

### P1 — Core + 측정 게이트
1. **스캐폴딩** — `pyproject.toml`(extras), `config.py`, `run.sh`.
2. **Device Manager + VRAM probe** — `devices/{manager.py, vram_probe.py, profile.py}`: 감지 + §3.6 실측·정밀도·워커수·Model-Fit 분기. AC-2/3.
3. **Engine + registry** — `engine/faster_whisper_engine.py`(transcribe: hotwords/initial_prompt/word_ts/vad), `model_registry.py`(rt=turbo/batch=large-v3, 오버라이드). AC-4.
4. **Audio ingest(스트리밍)** — `audio/ingest.py`: **ffmpeg를 파일로 파이프**(전체 배열 인메모리 금지), duration/size probe, 4h/2GB→`413`. `audio/vad.py`: Silero VAD. AC-7/9.
5. **CLI `detect`/`transcribe`/`bench`** — `bench`를 **P1로 전진**: 도메인 KO+EN 클립으로 turbo vs large-v3 **R-WER + entity 보존율 + 속도 + 실측 VRAM** 측정 → 하이브리드 게이트 판정. AC-4/12.
- **Exit:** CPU와 실 GPU 1종에서 단일 파일 전사 성공; `detect`가 measured VRAM·정밀도·워커수 출력; `bench`가 모델 델타 리포트 산출.

### P2 — API + Queue
6. **FastAPI + 인증** — `api/{app.py, deps.py, schemas.py}`: API Key+스코프(§3.8). AC-11.
7. **RQ 워커(SimpleWorker)** — `jobqueue/{broker.py, jobs.py, worker.py, cancel.py, inproc.py}`: §3.5 실행모델, §3.7d progress, §3.7a 취소, §3.6 OOM 강등. AC-5/6.
8. **Jobs 라우트** — `api/routes/jobs.py`: `POST /v1/jobs`(만재→`429`), `GET`(queue_position/progress), `result?format=`, `DELETE`(취소), `GET /v1/jobs`. AC-1/6.
9. **Results + retention** — `results/{store.py, formats.py, retention.py}`: §3.7b/e, 원본·파생 삭제, 7d TTL 터미널만. AC-11.
10. **Docker** — `Dockerfile.{gpu,cpu}` + compose(api+redis+worker). **CT2/CUDA/cuDNN 트리플 핀**(예: CUDA12+cuDNN9; 구형/Colab은 CT2 다운그레이드 경로 문서화). `detect`가 런타임 CUDA 버전 노출. AC-12.
11. **Admin** — `/health`, `/v1/system`(profile/워커/큐깊이/가용 VRAM), `/v1/models`.
- **Exit:** 다건 enqueue→progress→result; **강제 OOM 시 강등·실패 경로 시연**; 취소가 `cancelled`로 종료 + 임시파일 삭제 확인.

### P3 — Realtime
12. **실시간** — `pipeline/realtime.py`(§3.7c 계약: redecode_window/prefix/left-context/VAD), `api/routes/stream.py`(WS, partial/final/status, 백프레셔). AC-8.
- **Exit:** 부분결과 ≤5s + 최종 안정화; **30분 세션 메모리 평탄**(±15% 이내, 단조증가 없음).

### P4 — Output + Post-processing
13. **출력 옵션** — timestamps/word/SRT/VTT 요청별. AC-9.
14. **후처리(glossary/rules/flag)** — `postprocess/{pipeline.py, glossary.py, rules.py, confidence.py}`. AC-10.
- **Exit:** glossary on/off diff로 entity 보존 향상 측정; 저신뢰 플래그 부착.

### P5 — Advanced
15. **LLM 보정(옵션)** — `postprocess/llm.py`: local/external 백엔드, §3.8 egress 통제, confidence-gated, 기본 off. AC-10.
16. **Diarization(옵션)** — `diarization/pyannote_diarizer.py`(HF 토큰).
17. **Colab 터널** — `connectivity/tunnel.py`: **API lifespan에 종속 supervise**(같이 start/stop), URL 회전 시 재출력, 임시성 명시. AC-13.
18. **관측/벤치 확장** — `observability/{logging,metrics}.py`(큐깊이·워커가동·RTF·OOM 카운트), `bench` 확장.
- **Exit:** external egress allowlist+감사로그 동작; Colab `serve --tunnel cloudflare` 외부 200.

---

## 5. Acceptance Criteria (수치화)

스펙 AC-1~13 상속 + **모호 항목 절대 기준화**:
- **AC-2/3:** `detect`가 measured VRAM 기반 정밀도/워커수 산정; 1050→int8/CPU(large-v3 GPU 불가 명시), T4→int8_float16, A100/H100→fp16; `floor` 공식이 측정 상수 사용.
- **AC-4(혼용어, 절대):** 도메인 entity 용어(vLLM·API·FastAPI·Kubernetes·LLM·GPU 등) **verbatim 보존율 ≥ 95%**(hotwords on, domain set), 그리고 도메인 **R-WER ≤ {P1 bench 기준선}**. 보조로 batch(v3) ≤ realtime(turbo).
- **AC-5/6:** 다건 동시 → 각 `GET`이 `queue_position`(앞 N건)·`progress %`(processed_sec/total_sec); 만재 `429`.
- **AC-7(메모리, 절대):** 2h 배치 + 30분 WS에서 **워밍업 후 peak RSS 변동 ±15% 이내, 단조 증가 없음**; OOM 없음; 4h/2GB→`413`.
- **AC-8:** 부분결과 ≤5s(동시 ≤N 세션 한도 내), 최종 안정화; 다중 세션 부하에서도 REST/배치 응답성 유지(이벤트 루프 비블로킹, §3.9a).
- **AC-11:** 전사 후 원본+파생 wav 부재; 결과 7d 후 만료; 터미널 Job만 청소.

---

## 6. Risks and Mitigations
| Risk | Mitigation |
|------|-----------|
| CUDA fork 실패 | SimpleWorker/장수명·model-load-once, fork 금지(§3.5) |
| OOM(동시성) | **측정 VRAM**(§3.6) 기반 워커수, semaphore, 강등 최대 2회 후 실패 |
| turbo 혼용어 부족 | P1 bench 게이트(§4-5), 하이브리드/hotwords/후처리, 모델 스왑 |
| 실시간 떨림/누수 | LocalAgreement 계약(§3.7c), 버퍼 절단 불변식 테스트 |
| 취소 no-op / 임시파일 누출 | 협조적 취소 + `finally` 삭제(§3.7a/b) |
| 보관 경합 | 터미널 상태만 sweeper(§3.7e) |
| 외부 LLM PII 유출 | allowlist+opt-in+감사로그+마스킹(§3.8) |
| CT2/cuDNN 버전 불일치 | Dockerfile 트리플 핀 + 다운그레이드 경로(P2-10) |
| Redis SPOF | in-proc 폴백(D1-C), 헬스체크 |

## 7. Pre-mortem (deliberate — 3 시나리오, 위험표와 비중복)
1. **"P2 통합에서 워커가 fork-CUDA로 즉시 죽었다."** 원인: RQ 기본 fork. 예방: §3.5 SimpleWorker 강제 + P2 Exit에 "실 GPU 워커 기동" 게이트, P1에서 워커 기동 스파이크 선검증.
2. **"데모에서 'vLLM'→'브이엘엘엠'으로 신뢰 상실."** 원인: 모델/hotwords 미검증. 예방: **P1 bench 게이트**(entity 보존율 ≥95% 기준), 기본 hotwords 사전 동봉, 미달 시 v3 채택 자동 판정.
3. **"30분 WS 세션에서 메모리 누수로 컨테이너 OOM-kill."** 원인: 링버퍼/가설 미절단. 예방: §3.7c 절단 계약 + 버퍼 불변식 단위테스트 + P3 Exit의 ±15% 메모리 게이트.

## 8. Expanded Test Plan (deliberate, 수치 게이트 포함)
- **Unit:** Device 결정(cc/measured-VRAM→compute_type/workers/Model-Fit 분기), OOM 강등 캡(≤2), LocalAgreement 버퍼 절단 불변식, formats(srt/vtt 타임코드), rules 정규화, retention(터미널만), options(413/429).
- **Integration:** ffmpeg(스트리밍, 인메모리 아님)→engine→result, Redis enqueue→SimpleWorker→progress(save_meta throttle)→status, 취소 플래그→`cancelled`+임시파일 삭제, glossary diff(entity 보존율), egress allowlist 차단/허용.
- **E2E:** 파일 Job 수명주기(생성→progress→결과·원본+파생 삭제), 동시 다건 queue_position, WS 세션(부분≤5s), Colab `serve --tunnel cloudflare` 200, 강제 OOM 강등 경로.
- **Observability:** 메트릭(큐깊이·워커가동률·RTF·OOM 카운트·**실시간 레인 대기시간**), 구조적 로그(job_id 상관), `/health`·`/v1/system` 계약(**요청 vs 실효 compute_type 보고 검증**), **2h 배치/30분 WS RSS ±15% 평탄 검증**, external egress 감사로그 1건/전송, **다중 세션 AC-8(≤5s) + 동시 `GET /v1/jobs` 응답성 바운드**, **>180s 작업이 job_timeout으로 강제종료되지 않음**, **공유 GPU(배치+실시간) VRAM 비-oversubscribe Model-Fit 상호작용 테스트**.

## 9. Verification Steps
1. `uv sync && uv run python -m luke_scribe.cli detect` → measured VRAM·정밀도·워커수.
2. `... bench --samples samples/ko_en/` → turbo vs v3 R-WER·entity 보존율·VRAM → 하이브리드 판정.
3. `... transcribe samples/ko_en.wav` → entity verbatim ≥95% 확인.
4. `docker compose up` → 다건 `POST`/`GET`(progress/위치), 결과 후 원본+파생 부재, `DELETE`→`cancelled`.
5. 강제 작은-VRAM 환경 → 강등(≤2)·실패 경로 시연.
6. WS 클라이언트 30분 → 부분≤5s, RSS ±15%.
7. `pytest tests/`(unit/integration) + e2e 스모크.
8. Colab → `serve --tunnel cloudflare` URL 외부 200; egress allowlist 차단 테스트.

## 10. ADR (refined)
- **Decision:** faster-whisper 단일 엔진 + (게이트된)하이브리드 모델 + Redis(RQ **SimpleWorker**, no-fork, model-load-once) 배치 + 전용 WS 실시간 핸들러 + **측정 기반 DeviceProfile** + CLI/Docker(버전 핀).
- **Drivers:** 혼용어 정확도, 이식성/자동스케일(실측), 동시성/가시성.
- **Alternatives:** Celery(과함), in-proc(폴백/컨틴전시), WhisperLive/Kit(참고/P5), 멀티엔진(조기), RQ 기본 fork(CUDA 불가→거부).
- **Why chosen:** 사용자 확정(하이브리드/Redis/후처리/보관/상한/diarize) 준수 + fork/VRAM/취소 리스크를 명시 메커니즘으로 해소.
- **Consequences:** SimpleWorker 하트비트 공백→progress로 보완; 실시간 안정화 자체 구현; 1050 배치 large-v3 GPU 불가; Redis 의존; 공유 GPU에서 배치+실시간 VRAM 합산 회계(§3.9b)·실시간 decode 오프로딩(§3.9a) 필요.
- **Follow-ups:** Redis HA, SimulStreaming(P5), egress 마스킹 고도화. (도메인 bench는 follow-up이 아니라 **P1 게이트**로 승격.)

---

## 11. Open Questions (사용자 확인 권장)
1. 실제 배포가 **다중 GPU 워커**를 필요로 하나, 아니면 단일 T4/Colab 위주? (후자면 워커수 공식 위험 축소.)
2. turbo의 KO entity 보존율이 P1 bench에서 ≥95%면 **단일 모델로 단순화**할 의향이 있는지(VRAM/복잡도 절감).
3. 취소는 **협조적(세그먼트 경계)** 으로 충분한지, 즉시 hard-kill이 필요한지.

---

## 12. v2 Changelog (적용 내역 → 리뷰 매핑)
- **[P0-1]** §3.5 Worker Execution Model 신설 — SimpleWorker/no-fork/model-load-once, 배치·실시간 레인 분리, in-proc 컨틴전시. (Arch P0-1, Critic F1)
- **[P0-2]** §3.6 VRAM 부팅 실측 + 보수 상수(large-v3 fp16 10GB) + Model-Fit 분기 + `bench`를 **P1로 전진**(단계 역전 수정). (Arch P0-2/P1-4, Critic F2)
- **[P0-3]** §3.7 협조적 취소 + 임시파일 `finally` 삭제 + 보관 sweeper 터미널-한정. (Arch P0-3/P1-6, Critic F3)
- **[P1]** §3.7d progress 메커니즘(save_meta throttle), §3.8 보안 egress allowlist+감사+스코프, 하이브리드 P1 게이트. (Arch P1-4/5/7, Critic F4/F5)
- **[P1-신규]** §4 단계별 **Exit Criteria** + in-proc 컨틴전시 명시. (Critic F6)
- **[P2]** §3.7c LocalAgreement 계약, §4-4 ffmpeg 스트리밍 보장, §4-10/17 cloudflared lifecycle + CT2/CUDA/cuDNN 핀. (Arch P2-8/9/10)
- **[AC]** AC-4 절대 기준(entity 보존율 ≥95% + R-WER 기준선), AC-7 메모리 ±15% 평탄, 워커당_추정=측정값, OOM 강등 ≤2회, `429`/`413` 분리. (Critic 모호성 4건)
- **[Risk]** OOM 위험을 측정 상수 기반으로 재작성(순환성 제거); pre-mortem #1을 위험표와 비중복화하고 예방을 P1/P2 게이트에 배선.

---

## 13. v2.1 Changelog (합의 잔여조건 반영)
- **[NEW-1a]** §3.9a 실시간 decode `run_in_executor` 오프로딩(+GIL 검증/프로세스 분리 폴백).
- **[NEW-1b]** §3.9b `reserve`에 실시간 모델 VRAM 포함 → 공유 GPU oversubscribe 방지.
- **[R1/NEW-3]** §3.9e RQ `job_timeout ≥ 4h`.
- **[NEW-2/R2]** §3.9d 요청 vs 실효 compute_type 로깅 + AC-2/3 계약 + 관측 테스트.
- **[R3]** §3.9c 실시간 최대 동시 세션 상한 + 대기시간 메트릭 + 다중세션 AC-8 테스트.
- **[R4]** §3.9f Phase Exit 구속력(hard/soft) 명시.

## 합의 결과 (Consensus Outcome)
- **Architect (v2): APPROVE WITH CONDITIONS** — P0×3 해소 확인, 잔여 NEW-1(a/b)/2/3.
- **Critic (v2): APPROVE WITH CONDITIONS** — CRITICAL×3·MAJOR×3 해소 확인, NEW-1 검증 + R1~R4.
- 모든 잔여 조건을 **v2.1**에 반영 → **합의 도달**. iteration 2/5.

---
*Consensus reached (v2.1) — `pending approval`. 실행(team/ralph/autopilot)은 사용자의 별도 명시 승인이 있어야만 진행됩니다. 이 계획 모듈은 승인 전 소스 코드 수정·커밋·실행 스킬 호출을 하지 않습니다.*

exec
/bin/bash -lc 'rg -n "download|cache|cold|timestamp|word_timestamps|diar|pyannote|cloudflare|API key|api_keys|rotation|allowlist|SSRF|PII|retention|sweeper|executor|batch|batched|Batched|beam|VAD|Pascal|CUDA|cuDNN|CTranslate2|job_timeout|SimpleWorker|heartbeat|health|queue_position|save_meta|model" .omc/specs/deep-interview-luke-scribe-stt-api.md .omc/plans/consensus-luke-scribe-stt-api.md' in /root/luke_scribe
 succeeded in 0ms:
.omc/plans/consensus-luke-scribe-stt-api.md:15:내부용(비공개) 로컬 STT 전사 API. 단일 `Job` 추상화로 **배치(파일/영상)** 와 **실시간(WebSocket)** 입력 처리. faster-whisper(CTranslate2) 런타임, **하이브리드 모델**(실시간=turbo, 배치=large-v3) — 단 **P1 bench 게이트로 검증**. **Device Manager** 가 GPU/CPU를 감지하고 **부팅 시 VRAM 실측**으로 정밀도·워커수를 산정(1050~H100), `auto|cpu|cuda` 강제 가능. **Redis(RQ) 영속 큐**(배치) + **전용 실시간 핸들러**(WS), `queue_position`·`progress` 보고. 요청별 출력옵션(txt/ts/word/diarize/SRT/VTT). 후처리(glossary→rules→LLM(local/external, 기본 off·신뢰도 게이팅)→confidence). API Key 인증(+스코프), 결과 7일 보관·**파생 오디오 포함 즉시 삭제**, 4h/2GB → `413`, 큐 만재 → `429`. 배포: CLI(dev/Colab)+Docker(prod)+Colab cloudflared. diarization 옵션(pyannote).
.omc/plans/consensus-luke-scribe-stt-api.md:24:3. **Accuracy/latency 분리(검증 기반)** — batch=large-v3, realtime=turbo. 단 **하이브리드 채택은 P1 bench의 측정 델타로 게이트**(불충분하면 단일 모델로 단순화).
.omc/plans/consensus-luke-scribe-stt-api.md:25:4. **Privacy-first, enforced** — 원본+파생 오디오 즉시 삭제(모든 종료 경로 `finally`), 결과 7d TTL, 외부 LLM egress는 **allowlist+opt-in+감사로그** 없이는 금지.
.omc/plans/consensus-luke-scribe-stt-api.md:31:3. **동시성 + 가시성(queue_position/progress)**.
.omc/plans/consensus-luke-scribe-stt-api.md:36:- **(A) Redis + RQ `SimpleWorker`(no-fork) + 장수명 모델 보유 프로세스** ✅ *(채택, 사용자 Redis 확정 + fork 문제 회피)* — Pros: 영속·재시작 내성, 모델 1회 적재 후 재사용, CUDA-fork 충돌 회피. Cons: SimpleWorker는 작업 중 하트비트 없음 → progress emit로 보완 필요.
.omc/plans/consensus-luke-scribe-stt-api.md:37:- **(B) Redis + Celery** — Pros: 라우팅/우선순위/재시도 성숙. Cons: GPU 단일박스엔 과함. *Invalidation:* RQ+SimpleWorker로 충분.
.omc/plans/consensus-luke-scribe-stt-api.md:38:- **(C) in-process asyncio + GPU 세마포어** — dev/단발 폴백 + **P2 컨틴전시**(RQ/CUDA가 막히면 동일 Job 인터페이스로 폴백).
.omc/plans/consensus-luke-scribe-stt-api.md:39:> ⚠️ **근거:** RQ 기본 워커는 작업당 `os.fork()`하며, 부모에서 초기화된 CUDA 컨텍스트는 fork된 자식에서 재사용 불가([pytorch#40403](https://github.com/pytorch/pytorch/issues/40403)). 따라서 **fork 금지(SimpleWorker/장수명)** 가 필수. RQ progress는 내장이 없어 `job.save_meta()` 수동 호출 필요([RQ docs](https://python-rq.org/docs/jobs/)).
.omc/plans/consensus-luke-scribe-stt-api.md:43:- **(B) WhisperLive 백엔드 vendoring** — 검증된 WS+VAD를 backend로 감싸기. *Invalidation:* 통합·하이브리드 제어가 우선이나 **청킹/안정화 휴리스틱은 차용**.
.omc/plans/consensus-luke-scribe-stt-api.md:56:├── pyproject.toml (uv; extras: gpu, diarize, llm)   ├── run.sh   ├── .env.example
.omc/plans/consensus-luke-scribe-stt-api.md:59:    ├── config.py        # pydantic-settings (model rt/batch, device, precision, redis, retention, api_keys+scopes, tunnel, corrector+allowlist)
.omc/plans/consensus-luke-scribe-stt-api.md:63:    ├── engine/{base.py, faster_whisper_engine.py, model_registry.py}
.omc/plans/consensus-luke-scribe-stt-api.md:65:    ├── pipeline/{batch.py, realtime.py}
.omc/plans/consensus-luke-scribe-stt-api.md:68:    ├── diarization/pyannote_diarizer.py
.omc/plans/consensus-luke-scribe-stt-api.md:69:    ├── results/{store.py, formats.py, retention.py}
.omc/plans/consensus-luke-scribe-stt-api.md:75:- **배치 레인:** RQ **`SimpleWorker`**(또는 장수명 커스텀 워커). 워커 부팅 시 `WhisperModel`을 **1회 적재**해 프로세스 수명 동안 보유(재적재·fork 금지). **GPU당 워커 프로세스 1개 기본**, 워커 내 GPU 접근은 단일 스레드(동시 decode 금지). 동시성 = **device-bound 워커 프로세스 수**(인터프로세스), `--workers` 오버라이드.
.omc/plans/consensus-luke-scribe-stt-api.md:77:- **하트비트/heartbeat 공백 보완:** SimpleWorker는 작업 중 하트비트가 없으므로 §3.7d의 throttled progress emit가 사실상 하트비트 역할(장시간 작업이 "멈춤"으로 오인되지 않게).
.omc/plans/consensus-luke-scribe-stt-api.md:85:- **정밀도 기본:** cc≥7.0 & free≥12GB → `float16`; cc≥7.0 & free<12GB(T4 등) → **`int8_float16`**(워커 더 수용); Pascal(6.x) → `int8`; CPU → `int8`.
.omc/plans/consensus-luke-scribe-stt-api.md:91:- **(c) 실시간 LocalAgreement 계약:** 설정 명시 — `redecode_window`(예: 마지막 15s 오디오), `confirmed_prefix` 절단 규칙, `retained_left_context`(예: 5s), VAD 무음 경계에서 확정. 확정 세그먼트 방출 후 버퍼 절단(메모리 평탄). 단위 테스트로 버퍼 절단 불변식 검증.
.omc/plans/consensus-luke-scribe-stt-api.md:92:- **(d) progress 발행:** 워커가 제너레이터를 소비하며 `processed_sec/total_sec` 계산 → **throttled `job.save_meta()`**(N 세그먼트마다 또는 ≥1s). `total_sec`는 ingest 시 duration probe로 확보. `queue_position` = 레인 큐 인덱스.
.omc/plans/consensus-luke-scribe-stt-api.md:93:- **(e) 보관 sweeper 경합:** `results/retention.py`는 **터미널 상태(completed/failed/cancelled) Job만** 7d TTL 청소. {queued, processing} 보유 결과·임시물은 건드리지 않음.
.omc/plans/consensus-luke-scribe-stt-api.md:97:- **외부 egress 통제:** LLM `external`/`openai` 백엔드는 **config allowlist 엔드포인트에만** 송신 가능(SSRF 방지), **기본 off + 명시 opt-in + 전송 1건당 감사 로그(key id, endpoint, job id)**. 옵션으로 전송 전 PII 마스킹.
.omc/plans/consensus-luke-scribe-stt-api.md:100:- **(a) 이벤트 루프 비블로킹:** 실시간 turbo decode는 동기 CTranslate2 호출이므로 **`await loop.run_in_executor(single_thread_executor, decode)`** 로 오프로딩(단일 GPU 락 직렬성 유지). P3 착수 시 CT2의 GIL 해제 여부 검증 — 미해제면 실시간 레인을 **별도 디코드 프로세스**로 분리. *(NEW-1a)*
.omc/plans/consensus-luke-scribe-stt-api.md:104:- **(e) RQ job_timeout:** enqueue 시 `job_timeout ≥ 4h(+마진)`(duration probe 기반). RQ 기본 180s로는 장시간 작업이 3분에 강제 종료되어 AC-7 위반 → 반드시 상향. *(R1/NEW-3)*
.omc/plans/consensus-luke-scribe-stt-api.md:105:- **(f) Phase Exit 구속력:** 각 Phase Exit는 **hard gate**(미충족 시 다음 Phase 착수 금지). 단 옵션 기능(diarization/LLM/tunnel) 미완은 "문서화된 제한"으로 soft 허용. *(R4)*
.omc/plans/consensus-luke-scribe-stt-api.md:114:3. **Engine + registry** — `engine/faster_whisper_engine.py`(transcribe: hotwords/initial_prompt/word_ts/vad), `model_registry.py`(rt=turbo/batch=large-v3, 오버라이드). AC-4.
.omc/plans/consensus-luke-scribe-stt-api.md:115:4. **Audio ingest(스트리밍)** — `audio/ingest.py`: **ffmpeg를 파일로 파이프**(전체 배열 인메모리 금지), duration/size probe, 4h/2GB→`413`. `audio/vad.py`: Silero VAD. AC-7/9.
.omc/plans/consensus-luke-scribe-stt-api.md:121:7. **RQ 워커(SimpleWorker)** — `jobqueue/{broker.py, jobs.py, worker.py, cancel.py, inproc.py}`: §3.5 실행모델, §3.7d progress, §3.7a 취소, §3.6 OOM 강등. AC-5/6.
.omc/plans/consensus-luke-scribe-stt-api.md:122:8. **Jobs 라우트** — `api/routes/jobs.py`: `POST /v1/jobs`(만재→`429`), `GET`(queue_position/progress), `result?format=`, `DELETE`(취소), `GET /v1/jobs`. AC-1/6.
.omc/plans/consensus-luke-scribe-stt-api.md:123:9. **Results + retention** — `results/{store.py, formats.py, retention.py}`: §3.7b/e, 원본·파생 삭제, 7d TTL 터미널만. AC-11.
.omc/plans/consensus-luke-scribe-stt-api.md:124:10. **Docker** — `Dockerfile.{gpu,cpu}` + compose(api+redis+worker). **CT2/CUDA/cuDNN 트리플 핀**(예: CUDA12+cuDNN9; 구형/Colab은 CT2 다운그레이드 경로 문서화). `detect`가 런타임 CUDA 버전 노출. AC-12.
.omc/plans/consensus-luke-scribe-stt-api.md:125:11. **Admin** — `/health`, `/v1/system`(profile/워커/큐깊이/가용 VRAM), `/v1/models`.
.omc/plans/consensus-luke-scribe-stt-api.md:129:12. **실시간** — `pipeline/realtime.py`(§3.7c 계약: redecode_window/prefix/left-context/VAD), `api/routes/stream.py`(WS, partial/final/status, 백프레셔). AC-8.
.omc/plans/consensus-luke-scribe-stt-api.md:133:13. **출력 옵션** — timestamps/word/SRT/VTT 요청별. AC-9.
.omc/plans/consensus-luke-scribe-stt-api.md:139:16. **Diarization(옵션)** — `diarization/pyannote_diarizer.py`(HF 토큰).
.omc/plans/consensus-luke-scribe-stt-api.md:142:- **Exit:** external egress allowlist+감사로그 동작; Colab `serve --tunnel cloudflare` 외부 200.
.omc/plans/consensus-luke-scribe-stt-api.md:150:- **AC-4(혼용어, 절대):** 도메인 entity 용어(vLLM·API·FastAPI·Kubernetes·LLM·GPU 등) **verbatim 보존율 ≥ 95%**(hotwords on, domain set), 그리고 도메인 **R-WER ≤ {P1 bench 기준선}**. 보조로 batch(v3) ≤ realtime(turbo).
.omc/plans/consensus-luke-scribe-stt-api.md:151:- **AC-5/6:** 다건 동시 → 각 `GET`이 `queue_position`(앞 N건)·`progress %`(processed_sec/total_sec); 만재 `429`.
.omc/plans/consensus-luke-scribe-stt-api.md:161:| CUDA fork 실패 | SimpleWorker/장수명·model-load-once, fork 금지(§3.5) |
.omc/plans/consensus-luke-scribe-stt-api.md:166:| 보관 경합 | 터미널 상태만 sweeper(§3.7e) |
.omc/plans/consensus-luke-scribe-stt-api.md:167:| 외부 LLM PII 유출 | allowlist+opt-in+감사로그+마스킹(§3.8) |
.omc/plans/consensus-luke-scribe-stt-api.md:168:| CT2/cuDNN 버전 불일치 | Dockerfile 트리플 핀 + 다운그레이드 경로(P2-10) |
.omc/plans/consensus-luke-scribe-stt-api.md:172:1. **"P2 통합에서 워커가 fork-CUDA로 즉시 죽었다."** 원인: RQ 기본 fork. 예방: §3.5 SimpleWorker 강제 + P2 Exit에 "실 GPU 워커 기동" 게이트, P1에서 워커 기동 스파이크 선검증.
.omc/plans/consensus-luke-scribe-stt-api.md:177:- **Unit:** Device 결정(cc/measured-VRAM→compute_type/workers/Model-Fit 분기), OOM 강등 캡(≤2), LocalAgreement 버퍼 절단 불변식, formats(srt/vtt 타임코드), rules 정규화, retention(터미널만), options(413/429).
.omc/plans/consensus-luke-scribe-stt-api.md:178:- **Integration:** ffmpeg(스트리밍, 인메모리 아님)→engine→result, Redis enqueue→SimpleWorker→progress(save_meta throttle)→status, 취소 플래그→`cancelled`+임시파일 삭제, glossary diff(entity 보존율), egress allowlist 차단/허용.
.omc/plans/consensus-luke-scribe-stt-api.md:179:- **E2E:** 파일 Job 수명주기(생성→progress→결과·원본+파생 삭제), 동시 다건 queue_position, WS 세션(부분≤5s), Colab `serve --tunnel cloudflare` 200, 강제 OOM 강등 경로.
.omc/plans/consensus-luke-scribe-stt-api.md:180:- **Observability:** 메트릭(큐깊이·워커가동률·RTF·OOM 카운트·**실시간 레인 대기시간**), 구조적 로그(job_id 상관), `/health`·`/v1/system` 계약(**요청 vs 실효 compute_type 보고 검증**), **2h 배치/30분 WS RSS ±15% 평탄 검증**, external egress 감사로그 1건/전송, **다중 세션 AC-8(≤5s) + 동시 `GET /v1/jobs` 응답성 바운드**, **>180s 작업이 job_timeout으로 강제종료되지 않음**, **공유 GPU(배치+실시간) VRAM 비-oversubscribe Model-Fit 상호작용 테스트**.
.omc/plans/consensus-luke-scribe-stt-api.md:190:8. Colab → `serve --tunnel cloudflare` URL 외부 200; egress allowlist 차단 테스트.
.omc/plans/consensus-luke-scribe-stt-api.md:193:- **Decision:** faster-whisper 단일 엔진 + (게이트된)하이브리드 모델 + Redis(RQ **SimpleWorker**, no-fork, model-load-once) 배치 + 전용 WS 실시간 핸들러 + **측정 기반 DeviceProfile** + CLI/Docker(버전 핀).
.omc/plans/consensus-luke-scribe-stt-api.md:195:- **Alternatives:** Celery(과함), in-proc(폴백/컨틴전시), WhisperLive/Kit(참고/P5), 멀티엔진(조기), RQ 기본 fork(CUDA 불가→거부).
.omc/plans/consensus-luke-scribe-stt-api.md:196:- **Why chosen:** 사용자 확정(하이브리드/Redis/후처리/보관/상한/diarize) 준수 + fork/VRAM/취소 리스크를 명시 메커니즘으로 해소.
.omc/plans/consensus-luke-scribe-stt-api.md:197:- **Consequences:** SimpleWorker 하트비트 공백→progress로 보완; 실시간 안정화 자체 구현; 1050 배치 large-v3 GPU 불가; Redis 의존; 공유 GPU에서 배치+실시간 VRAM 합산 회계(§3.9b)·실시간 decode 오프로딩(§3.9a) 필요.
.omc/plans/consensus-luke-scribe-stt-api.md:210:- **[P0-1]** §3.5 Worker Execution Model 신설 — SimpleWorker/no-fork/model-load-once, 배치·실시간 레인 분리, in-proc 컨틴전시. (Arch P0-1, Critic F1)
.omc/plans/consensus-luke-scribe-stt-api.md:212:- **[P0-3]** §3.7 협조적 취소 + 임시파일 `finally` 삭제 + 보관 sweeper 터미널-한정. (Arch P0-3/P1-6, Critic F3)
.omc/plans/consensus-luke-scribe-stt-api.md:213:- **[P1]** §3.7d progress 메커니즘(save_meta throttle), §3.8 보안 egress allowlist+감사+스코프, 하이브리드 P1 게이트. (Arch P1-4/5/7, Critic F4/F5)
.omc/plans/consensus-luke-scribe-stt-api.md:215:- **[P2]** §3.7c LocalAgreement 계약, §4-4 ffmpeg 스트리밍 보장, §4-10/17 cloudflared lifecycle + CT2/CUDA/cuDNN 핀. (Arch P2-8/9/10)
.omc/plans/consensus-luke-scribe-stt-api.md:222:- **[NEW-1a]** §3.9a 실시간 decode `run_in_executor` 오프로딩(+GIL 검증/프로세스 분리 폴백).
.omc/plans/consensus-luke-scribe-stt-api.md:224:- **[R1/NEW-3]** §3.9e RQ `job_timeout ≥ 4h`.
.omc/specs/deep-interview-luke-scribe-stt-api.md:33:| 3 | **Realtime Pipeline** | active | VAD·청크·부분/최종 결과 스트리밍 | AC-8 |
.omc/specs/deep-interview-luke-scribe-stt-api.md:34:| 4 | **Output / Results** | active | 요청별 출력옵션(txt/ts/word/diarize/SRT/VTT), 결과 보관(7일) | AC-9, AC-11 |
.omc/specs/deep-interview-luke-scribe-stt-api.md:35:| 5 | **Job Queue / Concurrency** (1급) | active | Job 추상화, **Redis 영속 큐**, 워커풀, 우선순위 레인, queue_position·진행률 | AC-5, AC-6 |
.omc/specs/deep-interview-luke-scribe-stt-api.md:38:| 8 | **Connectivity / Tunnel** | active | Colab 등 공인 IP 없는 환경 자동 외부 노출(cloudflared 등) | AC-13 |
.omc/specs/deep-interview-luke-scribe-stt-api.md:51:- **하드웨어 폭이 매우 넓음:** 개발=GTX 1050(Pascal, 2~4GB), 테스트=Colab/T4/L4/A100/H100. → 고정 수치 설정 불가, **자동 산정 필수**.
.omc/specs/deep-interview-luke-scribe-stt-api.md:52:- **정밀도 자동 선택:** compute capability ≥ 7.0 → fp16, Pascal(6.x) → int8, VRAM 부족 → CPU 폴백, CPU → int8.
.omc/specs/deep-interview-luke-scribe-stt-api.md:54:- **모델 하이브리드:** 실시간=turbo, 배치=large-v3 (둘 다 설치, `model` 오버라이드 가능).
.omc/specs/deep-interview-luke-scribe-stt-api.md:81:- [ ] **AC-6** 호출자가 `queue_position`(앞 N건)과 `progress`(처리된 길이/전체, %)를 조회할 수 있다.
.omc/specs/deep-interview-luke-scribe-stt-api.md:82:- [ ] **AC-7** 장시간/대용량 파일이 VAD 세그먼트로 분할되어 진행률을 제공하고 메모리 사용이 일정하다. 4h/2GB 초과는 `413`.
.omc/specs/deep-interview-luke-scribe-stt-api.md:84:- [ ] **AC-9** 영상 파일이 ffmpeg로 오디오 추출 후 전사되고, 출력 옵션(timestamps/word/diarize/formats)이 요청별로 동작한다.
.omc/specs/deep-interview-luke-scribe-stt-api.md:88:- [ ] **AC-13** Colab에서 `--tunnel cloudflare`로 공개 URL이 자동 발급되어 외부에서 호출된다.
.omc/specs/deep-interview-luke-scribe-stt-api.md:103:        │                │  │  - upload    │     │   (realtime/batch)  │  │
.omc/specs/deep-interview-luke-scribe-stt-api.md:104:  WS    ├── 실시간 오디오▶│  │  - WS stream │     │  - queue_position   │  │
.omc/specs/deep-interview-luke-scribe-stt-api.md:113:        │                │  │ VAD→chunk→   │◀───▶│  ┌───────────────┐  │  │
.omc/specs/deep-interview-luke-scribe-stt-api.md:117:        │                │         │             │  │ batch=large-v3│  │  │
.omc/specs/deep-interview-luke-scribe-stt-api.md:130:        │                │  Connectivity/Tunnel (Colab→cloudflared 자동)  │
.omc/specs/deep-interview-luke-scribe-stt-api.md:139:| `POST` | `/v1/jobs` | multipart: `file`(오디오/영상) + `options`(JSON). → `{job_id, status:"queued", queue_position}` |
.omc/specs/deep-interview-luke-scribe-stt-api.md:140:| `GET` | `/v1/jobs/{id}` | 상태 조회: `queued`(queue_position, jobs_ahead) / `processing`(progress %, processed_sec/total_sec, eta) / `completed` / `failed`(error) |
.omc/specs/deep-interview-luke-scribe-stt-api.md:150:**Admin/관측:** `GET /health`, `GET /v1/system`(device 프로파일·워커수·큐 깊이), `GET /v1/models`.
.omc/specs/deep-interview-luke-scribe-stt-api.md:158:  "model": null,                         // null=경로별 기본(rt=turbo, batch=large-v3). 오버라이드 가능
.omc/specs/deep-interview-luke-scribe-stt-api.md:161:  "timestamps": true,                    // 세그먼트 타임스탬프
.omc/specs/deep-interview-luke-scribe-stt-api.md:162:  "word_timestamps": false,              // 단어 단위
.omc/specs/deep-interview-luke-scribe-stt-api.md:163:  "diarize": false,                      // 화자 분리(pyannote, opt, HF 토큰)
.omc/specs/deep-interview-luke-scribe-stt-api.md:171:    "corrector_model": null              // 백엔드별 모델/엔드포인트
.omc/specs/deep-interview-luke-scribe-stt-api.md:178:- **런타임:** **faster-whisper (CTranslate2)** — openai-whisper 대비 ~4배 빠르고 메모리 적음, GPU/CPU·fp16/int8 지원, Silero VAD 내장, 배치 추론 지원.
.omc/specs/deep-interview-luke-scribe-stt-api.md:182:  - 두 모델 모두 설치, 경로별 기본값 적용. `model` 옵션/환경변수로 런타임 오버라이드.
.omc/specs/deep-interview-luke-scribe-stt-api.md:193:- WebSocket 오디오 프레임 → 링버퍼 → **Silero VAD**로 발화 구간 검출 → 청크 구성 → 전사 → 부분/최종 방출.
.omc/specs/deep-interview-luke-scribe-stt-api.md:195:- **참고 구현:** [WhisperLive](https://github.com/collabora/WhisperLive)(faster-whisper 백엔드, WS, VAD), [WhisperLiveKit](https://github.com/QuentinFuxa/WhisperLiveKit)(AlignAtt), [whisper_streaming](https://github.com/ufal/whisper_streaming)(→ SimulStreaming 대체 추세). 정책 채택/재구현 모두 가능.
.omc/specs/deep-interview-luke-scribe-stt-api.md:210:- **진행률:** 장시간 파일은 VAD 세그먼트 분할 → `progress = 완료 세그먼트 / 전체`(또는 처리 오디오초/전체초). `queue_position` = 큐 인덱스.
.omc/specs/deep-interview-luke-scribe-stt-api.md:215:**감지:** CUDA 가용 여부, device name, VRAM total/free, compute capability, CPU 코어/RAM.
.omc/specs/deep-interview-luke-scribe-stt-api.md:221:| **GTX 1050** | 6.1 (Pascal, fp16 비효율) | **int8**, 부족 시 **CPU 폴백** | 1 워커 또는 CPU |
.omc/specs/deep-interview-luke-scribe-stt-api.md:222:| CUDA 없음 | — | **CPU int8** | CPU 코어 기반(1~2) |
.omc/specs/deep-interview-luke-scribe-stt-api.md:224:- **모델 적재 검사:** `model_fp16 > free_VRAM` → int8 시도 → 그래도 부족 → CPU 폴백(또는 작은 모델 + 경고).
.omc/specs/deep-interview-luke-scribe-stt-api.md:226:- **오버라이드:** `--device auto|cpu|cuda:N`, `--compute-type`, `--model`, `--workers`.
.omc/specs/deep-interview-luke-scribe-stt-api.md:234:3. **LLM 보정(확정: 백엔드 설정화, 기본 off·confidence-gated):** 저신뢰/고WER 구간만 교정(Judge-Editor: 고신뢰 스팬 유지, 불확실 스팬만 재작성). **백엔드 플러그형** — `local`(소형 LLM, 오프라인·프라이버시) 또는 `openai`/`external`(OpenAI 호환 엔드포인트), `corrector_model` 설정 가능. 기본 비활성(약 HW 보호·과교정 방지).
.omc/specs/deep-interview-luke-scribe-stt-api.md:243:- **기본: cloudflared Quick Tunnel** — `https://<random>.trycloudflare.com`, **계정/도메인 불필요**, 임시 URL, 제로 설정. (`--tunnel cloudflare`)
.omc/specs/deep-interview-luke-scribe-stt-api.md:255:- **설정:** env + `.env`/yaml — model(rt/batch), device, workers, api_keys, retention_days, tunnel, redis_url, corrector.
.omc/specs/deep-interview-luke-scribe-stt-api.md:259:Python 3.11+, **FastAPI** + uvicorn, **faster-whisper(CTranslate2)** (turbo + large-v3), **ffmpeg**(영상→16kHz mono), **Silero VAD**(faster-whisper 내장), **Redis + RQ/Celery**(영속 큐), pydantic v2, **LLM 보정 백엔드**(local: llama.cpp/transformers · external: OpenAI 호환 client), (옵션) **pyannote.audio**(diarization, HF 토큰), **cloudflared**/pyngrok(터널), loguru/structlog(로깅), prometheus-client(메트릭, 옵션).
.omc/specs/deep-interview-luke-scribe-stt-api.md:267:| "한 번에 한 작업이면 충분" | 동시/대기 다작업·중간 추가 입력은? | 큐를 1급 승격, Redis 영속, 우선순위 레인 + queue_position/progress |
.omc/specs/deep-interview-luke-scribe-stt-api.md:272:| "Colab도 IP로 호출" | Colab은 공인 IP 없음 | cloudflared Quick Tunnel 자동 노출 |
.omc/specs/deep-interview-luke-scribe-stt-api.md:282:| Job | core | id, type(file/stream/video), status, queue_position, progress, options, created_at | has many Segment, produces TranscriptResult |
.omc/specs/deep-interview-luke-scribe-stt-api.md:284:| Engine | core | runtime(faster-whisper), model(turbo/large-v3), compute_type | used by Worker |
.omc/specs/deep-interview-luke-scribe-stt-api.md:287:| Worker | core | id, device, model_instance, busy | consumes Queue(Redis), runs Engine |
.omc/specs/deep-interview-luke-scribe-stt-api.md:288:| Queue | core | lane(realtime/batch), depth, backend(redis) | holds Job |
.omc/specs/deep-interview-luke-scribe-stt-api.md:291:| RequestOptions | supporting | language, model, device, formats, hotwords, post_correction | configures Job |
.omc/specs/deep-interview-luke-scribe-stt-api.md:293:| PostProcessor | supporting | mode, backend(local/external), corrector_model, stages | transforms TranscriptResult |
.omc/specs/deep-interview-luke-scribe-stt-api.md:297:| Tunnel | supporting | provider(cloudflare/ngrok/none), public_url | exposes API |
.omc/specs/deep-interview-luke-scribe-stt-api.md:320:| 6 | 화자 분리 | **옵션 포함**(pyannote, HF 토큰), 기본 off, 요청 시 `diarize=true`. |
.omc/specs/deep-interview-luke-scribe-stt-api.md:330:| **P3 Realtime** | WebSocket 스트리밍, VAD 청크, LocalAgreement 부분/최종(turbo) | 실시간 전사 |
.omc/specs/deep-interview-luke-scribe-stt-api.md:331:| **P4 Output+Post** | timestamps/word/SRT/VTT, glossary/rules 후처리, confidence flags | 풍부한 출력 + 1차 후처리 |
.omc/specs/deep-interview-luke-scribe-stt-api.md:332:| **P5 Advanced** | LLM 후처리 백엔드(local/external, 옵션), diarization(pyannote, 옵션), Colab cloudflared 자동, 메트릭/모니터링, `bench` | 운영·고급 기능 |
.omc/specs/deep-interview-luke-scribe-stt-api.md:336:- **약 GPU(1050) 실시간 한계:** turbo도 Pascal/2GB에선 버거움 → int8/CPU 폴백, 실시간은 사실상 T4+ 권장(문서화).
.omc/specs/deep-interview-luke-scribe-stt-api.md:342:- **Quick Tunnel 임시 URL:** trycloudflare는 비영구 → 안정 필요 시 named tunnel/ngrok.
.omc/specs/deep-interview-luke-scribe-stt-api.md:369:- Colab 등 공인 IP 부재 환경 **자동 외부 노출** → cloudflared Quick Tunnel 기본.

**Reviewing design issues**

I need to consider the existing issues since the user wants feedback on the design docs only. My task is to assess and prioritize new and current problems. I'll identify any poor decisions or risks that weren't handled well and refer to specific sections.

For instance, there's the multi-model residency and boot-time VRAM probe, which might cause issues with loading. Also, I need to take note of the potential delays in downloading models and the need for a robust health check system.
**Assessing model load and downgrade issues**

Plan 3.6 mentions loading target model and compute type once for measurement, but it doesn't clarify if artifacts are cached during image build or at startup; this is crucial, especially in Colab. I need to refer to sections 3.6 and P2-10.

Also, the model downgrade semantics present a potential issue. It seems that downgrading from large-v3 to turbo under VRAM pressure could contradict accuracy expectations. I should highlight the need for a clear fail-fast policy or adjust acceptance criteria to reflect model performance accurately.
**Addressing architectural priorities**

This is about ensuring architectural correctness, as requests on the batch lane could default to turbo due to memory limitations, which goes against the contract. One solution might be to set the batch model preference to accuracy by default and only degrade precision or device as needed, rather than the model itself unless there's a specific override.

Additionally, I need to think about the shared GPU scheduling issue, as there may be unfairness leading to real-time worker starvation when they share the same GPU structure.
**Navigating worker heartbeat nuances**

I'm considering a nuanced point: `save_meta` isn’t the same as a worker heartbeat. I wonder if this affects the architecture. Without a heartbeat, the worker state could seem outdated, leading to expired TTLs and maintenance tasks mistakenly marking the worker as dead. I need to understand the behavior of the RQ SimpleWorker, which doesn’t send heartbeats during jobs. This could impact the job registry expiration since it uses heartbeats to extend TTL. I might look into the documentation.
**Evaluating job specs**

I see that the user-facing GET /v1/jobs/{id} only shows queue position within a lane, which raises questions about how that compares to completion ETA when considering different hardware lanes. The spec only mentions queue_position and jobs ahead; that might suffice. Still, the handling of word timestamps and diarization costs is under-addressed. The acceptance test doesn't require performance checks for these options, despite their potential high latency and resource demands. So, I need to weigh these considerations carefully.
**Assessing resource isolation**

I notice there's a gap in resource isolation and accounting for the pyannote model and word-timestamp alignment in the architecture. For the internal API, there could be a risk if users request diarization on the same GPU, which might overload memory. It's important to establish either a separate queue or denote that it's incompatible with real-time or shared GPUs. The spec mentions including word_timestamps and diarization, but there's no VRAM accounting for that. I should highlight this as a potential issue, although it may not be a top priority unless it leads to out-of-memory (OOM) errors.
**Evaluating inference processes**

I’m considering the priority of faster-whisper batched inference versus sequential processing. The spec indicates support for batched inference, but doesn't require its use. The worker model operates with sequential single-thread decoding, which reduces latency but may result in underutilization on larger GPUs. User inquiries about comparing these approaches suggest noting the absence of an explicit choice. On A100/H100, using single sequential decoding could lead to issues with memory bandwidth. A benchmark gate appears necessary for proper selection between these options, as the current plan hardcodes process-per-worker.
**Identifying throughput issues**

I’m looking into the missing throughput for P1/P2 benches, as cited in §3.5 and §3.6. This seems to be a significant oversight. On CT2 with faster-whisper, it appears that using the `BatchedInferencePipeline` or adjusting the `batch_size` could enhance throughput. It’s important to address this to ensure optimal performance and avoid potential bottlenecks. I’ll need to dig deeper into these specifications and their implications for processing efficiency.
codex
I’m validating a couple of dependency/runtime assumptions against primary docs before I lock the review. The plan’s biggest open risks are around actual CTranslate2 GPU support on Pascal-class cards and whether its worker/concurrency assumptions hold on shared GPUs.
web search:
web search: CTranslate2 GPU requirements compute capability Pascal official docs
web search:
web search: https://opennmt.net/CTranslate2/hardware_support.html?highlight=mkl
**Evaluating source requirements**

The user wants just review text, but I think I should include a brief note about the primary document indicating RQ SimpleWorker isn’t meant for production. I must also provide links if I use web sources, but they specifically requested section numbers instead of links. The system states I need to link to sources used, though, so maybe I can add them parenthetically without excessive quoting. I might also need to check out the CTranslate2 multithreading and parallelism documentation for reference.
web search:
web search: CTranslate2 multithreading parallelism official docs batched inference
web search:
web search: https://opennmt.net/CTranslate2/parallel.html
**Searching for word timestamps**

I'm thinking that I could search for "word_timestamps" in the faster-whisper README. Maybe I could use a "find" function to locate specific lines, perhaps around line 430. I’m wondering if I really need to use "search_query" at all since it seems like just opening the document and finding the information directly could be quicker. I'll look into the pattern for "word_timestamps" on the website directly. Let’s see what I can find!
web search:
web search: 'word_timestamps' in https://github.com/SYSTRAN/faster-whisper
**Evaluating worker count for GPU jobs**

I need to choose the worker count based on free VRAM at boot. However, on shared GPU boxes, free VRAM can fluctuate as other processes allocate memory later, leading to out-of-memory (OOM) issues. It's crucial to dynamically check the worker count before each job since just using the boot-time measurement works for dedicated GPUs. This method poses a risk, particularly when the environment changes. I need to ensure proactive measures, like re-probing or using admission control based on real-time memory availability. Good catch!
**Assessing worker count and admission control**

There’s an issue with the global worker count remaining too high, even with OOM degradation. I really need to implement live admission control since this is an architectural decision and likely one of the top three priorities. It’s clear that while boot-time VRAM sizing is necessary, it isn't enough. I must ensure runtime admission control for every job, particularly with multiple models running. Also, using multiple processes per GPU may not be optimal for memory usage due to weight duplication.
**Evaluating throughput and multi-model residency**

Using H100 allows multiple copies, but I’m concerned that throughput could actually be worse compared to a single process with inter-threads or a batched pipeline. This relates directly to the batched versus sequential issue, especially since the user asked about assessing multi-model residency and batched inference. I need to bring up that `SimpleWorker` isn't recommended for production and lacks periodic heartbeats. The plan I have includes saving metadata as a heartbeat, but I'm wondering if that’s sufficient to elevate its status.
**Evaluating API options**

I noticed that there are mentions of API key scopes, but there’s a lack of details regarding specific per-scope option caps. It’s essential to have either an admin-only override or a policy matrix. The internal API seems to support dynamic model/device overrides, but without a proper server-side policy, clients could exploit it and create costly combinations. So, tying overrides to scope or implementing an allowlist seems like a vital step here for security and architecture.
**Considering tech setup**

I see that the specifications mention using the ffmpeg binary since faster-whisper bundles PyAV. However, my plan intentionally calls for an ffmpeg file pipeline. I might be facing an issue with Cloudflared's Quick Tunnel because it generates a random domain that's not stable and lacks TLS pinning. But, I think this isn't a major concern beyond public exposure, so I should keep that in mind as I move forward.
**Evaluating compute compatibility**

I'm examining the allocation of compute and measuring the load with two models: large-v3 and turbo, both in fp16/int8. The startup process could take several minutes, which feels a bit like a cold start issue. According to the CT2 docs, the GTX 1050 should support CUDA12 with specific drivers. However, it’s crucial to confirm compatibility due to hardware nuances. I need to think about the necessity for an explicit tested matrix for hardware support, considering dependencies like CUDA and cuDNN versions.
**Identifying model management concerns**

I'm noticing potential issues with faster-whisper, especially since it auto-downloads models from Hugging Face Hub. Using `WhisperModel("large-v3")` may pose constraints regarding the "private internal API" and local/offline model weights. There's no plan to package model artifacts for Docker, which is a significant requirement mismatch. I need to suggest an explicit pre-download and health check. Also, I'm curious if enough diagnostics are kept for failed jobs and whether the egress allowlist covers model names or API keys. Lastly, the deployment process with WebSocket handling seems inconsistent.
**Addressing GPU resource concerns**

I've noticed a potentially significant issue with multiple API workers. If each worker creates its own turbo model and locks GPU resources, it could unexpectedly multiply VRAM usage, leading to problems with shared-GPU accounting. This is new and quite critical! The plan mentions that using Docker with uvicorn/gunicorn could mean multiple web worker processes might each load their own real-time turbo handler and model unless I configure it for a single worker. I need to keep that in mind.
**Identifying architectural issues**

I see a potential issue with the Redis queue and results storage using local filesystem or SQLite in separate Docker containers. Since the worker writes results to its own filesystem, the API won't be able to access them unless there's a shared volume or network-accessible database. This could lead to major bugs. The plan mentions that the default storage is local file/SQLite but it needs to extend sharing to S3 or another database. It’s a critical aspect that needs to be addressed urgently!
**Prioritizing architectural changes**

I think we might need to implement a shared persistent volume or a Redis/DB-backed result store right from the start. This is looking like a higher priority than I initially thought. My top priorities should be: ensuring the shared result store exists across API and worker containers, managing real-time model residency with multiple API processes and shared GPU scheduling, and addressing offline model artifacts. The result store issue seems particularly severe due to its unsound design, so I’ll focus on that!
**Evaluating Pascal support**

I think I shouldn't mention GPU support as an issue since external documentation shows it's supported. It’s plausible to say Pascal support itself works, but the tested matrix isn't specified. The user asked for a Pascal assessment, so I could mention that "No blocker on Pascal from CT2 alone," focusing more on the pinning/driver matrix rather than just compute capability, which is nuanced. I should make sure the final answer is concise, around 650 words, using flat bullet points.
**Formulating the review**

I think I should include a short sentence about external sanity checks since the user requested a review text without commentary. It isn't necessary to mention file links since they’re document sections, not code file links, but I could cite file names plainly. The final answer can include references to specific sections. Clickable file links sound nice, but they might not be necessary. Still, I should mention the path to the file, like [deep-interview-luke-scribe-stt-api.md]. Let's keep it straightforward.
**Structuring the final report**

I'm thinking I should craft the final report with around 7 concise bullet points, plus a section for the Top 3 changes and a Verdict. Each bullet point will be a single paragraph, and I want to keep the formatting simple without nested bullets. Maybe I could incorporate severity labels for clarity. I believe I’ll need to include section citations in parentheses and add external links inline to reference tools like RQ and faster-whisper. The user requested section numbers, so I’ll make sure to include those.
codex
**Prioritized Findings**

- **Critical:** The realtime topology is still unsound for production deployment. §3.5/§3.9 assume one in-process turbo handler with one GPU lock, but Deployment still implies `uvicorn/gunicorn` multi-process serving (spec Deployment; plan §3.5, §3.9, P2-10). In practice, each API worker would get its own turbo model and its own lock, multiplying VRAM and defeating the session cap. Separately, `reserve` handles VRAM, not compute-time contention, so a long batch decode can still violate AC-8 on the same GPU. This needs a singleton decode service/process per GPU, or an explicit “API workers=1 on GPU nodes + batch paused/segregated while WS is active” rule.

- **Critical:** The default results store is not deployment-complete. The spec allows local file/SQLite storage (spec §4 Output / Results), while P2 deploys separate `api + redis + worker` containers (plan P2-9/10). Without an explicitly shared volume or a common DB/object store, the worker writes results the API cannot read. Even with SQLite on a shared filesystem, you need an explicit locking and failure model. As written, `GET /v1/jobs/{id}` and `/result` are not reliably correct across containers.

- **High:** The model-fallback contract is still semantically unsafe. §3.6 allows automatic `large-v3 -> turbo -> CPU` downgrade, but the system contract says batch is the accuracy lane and AC-4 assumes batch should outperform realtime (spec Goal/Constraints, AC-4; plan §3.6, §5). Silent model downgrade makes batch semantics probabilistic. Combined with request-level `model/device/compute_type` overrides (spec §1 options) and only coarse scopes in §3.8, clients can force capacity-destabilizing or accuracy-reducing runs. Batch should degrade precision/device, not model, unless the caller explicitly opts in.

- **High:** The offline model-artifact story is missing. The spec requires local/offline weights (spec Constraints), but neither doc defines prefetch/cache/warmup. With faster-whisper, model names resolve to auto-download behavior unless artifacts are already present; boot-time VRAM probing in §3.6 can therefore trigger network download and cold-start latency. I would require explicit artifact packaging or startup prefetch, cache path ownership, and readiness gating before the service accepts traffic. External check: [faster-whisper README](https://github.com/SYSTRAN/faster-whisper).

- **High:** `SimpleWorker` production semantics are still under-handled. §3.7d improves progress, but `job.save_meta()` is not a substitute for worker heartbeats. The plan treats it as one (§3.5, §3.7d), while RQ’s own docs say `SimpleWorker` lacks periodic heartbeats and is not recommended for production. If RQ stays, I would require a custom long-lived worker/liveness model rather than stock `SimpleWorker` assumptions. External check: [RQ workers docs](https://python-rq.org/docs/workers/).

- **Medium:** GPU throughput strategy remains unresolved. §3.5 hardcodes sequential per-process decoding and §3.6 sizes only by VRAM. That ignores CTranslate2 same-device parallelism/shared weights and faster-whisper batched inference, so large GPUs may end up with too many model copies and worse tail latency than a single shared model with batching/inter-threads. This is especially relevant on L4/A100/H100. External checks: [CTranslate2 parallelism](https://opennmt.net/CTranslate2/parallel.html), [faster-whisper batching](https://github.com/SYSTRAN/faster-whisper).

- **Medium:** Optional expensive features are not resource-accounted or acceptance-tested. `word_timestamps` and `diarize` are request-visible in the spec and AC-9, but §3.6 does not budget for their latency/VRAM cost, and P5 adds pyannote as a second model/runtime with no device-isolation policy. VAD is also central to progress and streaming stability, but there is no fixed lane-specific `vad_parameters` acceptance set for noisy/overlapping Korean speech.

- **Medium:** The acceptance plan still leaves key gates floating. AC-4 retains a placeholder R-WER baseline, AC-8 leaves session cap `N` undefined, and there are no explicit tests for key rotation/revocation, model-cache miss/readiness, tunnel abuse controls, or multi-process API topology under shared GPU pressure (§5, §8, AC-4, AC-8, AC-13).

**Top 3 Required Changes Before Coding**

1. Make realtime GPU execution a singleton per GPU and define the production API worker topology explicitly; do not leave it in-process under unconstrained multi-worker FastAPI deployment.
2. Replace “local file/SQLite by default” with a clearly shared durable results store for `api` and `worker`, or mandate a shared volume/DB architecture in P2.
3. Tighten execution policy: no silent batch model downgrade, restrict expensive overrides by scope, and add explicit model artifact prefetch/readiness rules.

**Verdict**

**NEEDS-CHANGES**. The plan is close, but the remaining issues are architectural, not cosmetic: deployment topology, shared-state correctness, and execution-policy semantics are still loose enough to cause real failures during implementation.
tokens used
89,565
**Prioritized Findings**

- **Critical:** The realtime topology is still unsound for production deployment. §3.5/§3.9 assume one in-process turbo handler with one GPU lock, but Deployment still implies `uvicorn/gunicorn` multi-process serving (spec Deployment; plan §3.5, §3.9, P2-10). In practice, each API worker would get its own turbo model and its own lock, multiplying VRAM and defeating the session cap. Separately, `reserve` handles VRAM, not compute-time contention, so a long batch decode can still violate AC-8 on the same GPU. This needs a singleton decode service/process per GPU, or an explicit “API workers=1 on GPU nodes + batch paused/segregated while WS is active” rule.

- **Critical:** The default results store is not deployment-complete. The spec allows local file/SQLite storage (spec §4 Output / Results), while P2 deploys separate `api + redis + worker` containers (plan P2-9/10). Without an explicitly shared volume or a common DB/object store, the worker writes results the API cannot read. Even with SQLite on a shared filesystem, you need an explicit locking and failure model. As written, `GET /v1/jobs/{id}` and `/result` are not reliably correct across containers.

- **High:** The model-fallback contract is still semantically unsafe. §3.6 allows automatic `large-v3 -> turbo -> CPU` downgrade, but the system contract says batch is the accuracy lane and AC-4 assumes batch should outperform realtime (spec Goal/Constraints, AC-4; plan §3.6, §5). Silent model downgrade makes batch semantics probabilistic. Combined with request-level `model/device/compute_type` overrides (spec §1 options) and only coarse scopes in §3.8, clients can force capacity-destabilizing or accuracy-reducing runs. Batch should degrade precision/device, not model, unless the caller explicitly opts in.

- **High:** The offline model-artifact story is missing. The spec requires local/offline weights (spec Constraints), but neither doc defines prefetch/cache/warmup. With faster-whisper, model names resolve to auto-download behavior unless artifacts are already present; boot-time VRAM probing in §3.6 can therefore trigger network download and cold-start latency. I would require explicit artifact packaging or startup prefetch, cache path ownership, and readiness gating before the service accepts traffic. External check: [faster-whisper README](https://github.com/SYSTRAN/faster-whisper).

- **High:** `SimpleWorker` production semantics are still under-handled. §3.7d improves progress, but `job.save_meta()` is not a substitute for worker heartbeats. The plan treats it as one (§3.5, §3.7d), while RQ’s own docs say `SimpleWorker` lacks periodic heartbeats and is not recommended for production. If RQ stays, I would require a custom long-lived worker/liveness model rather than stock `SimpleWorker` assumptions. External check: [RQ workers docs](https://python-rq.org/docs/workers/).

- **Medium:** GPU throughput strategy remains unresolved. §3.5 hardcodes sequential per-process decoding and §3.6 sizes only by VRAM. That ignores CTranslate2 same-device parallelism/shared weights and faster-whisper batched inference, so large GPUs may end up with too many model copies and worse tail latency than a single shared model with batching/inter-threads. This is especially relevant on L4/A100/H100. External checks: [CTranslate2 parallelism](https://opennmt.net/CTranslate2/parallel.html), [faster-whisper batching](https://github.com/SYSTRAN/faster-whisper).

- **Medium:** Optional expensive features are not resource-accounted or acceptance-tested. `word_timestamps` and `diarize` are request-visible in the spec and AC-9, but §3.6 does not budget for their latency/VRAM cost, and P5 adds pyannote as a second model/runtime with no device-isolation policy. VAD is also central to progress and streaming stability, but there is no fixed lane-specific `vad_parameters` acceptance set for noisy/overlapping Korean speech.

- **Medium:** The acceptance plan still leaves key gates floating. AC-4 retains a placeholder R-WER baseline, AC-8 leaves session cap `N` undefined, and there are no explicit tests for key rotation/revocation, model-cache miss/readiness, tunnel abuse controls, or multi-process API topology under shared GPU pressure (§5, §8, AC-4, AC-8, AC-13).

**Top 3 Required Changes Before Coding**

1. Make realtime GPU execution a singleton per GPU and define the production API worker topology explicitly; do not leave it in-process under unconstrained multi-worker FastAPI deployment.
2. Replace “local file/SQLite by default” with a clearly shared durable results store for `api` and `worker`, or mandate a shared volume/DB architecture in P2.
3. Tighten execution policy: no silent batch model downgrade, restrict expensive overrides by scope, and add explicit model artifact prefetch/readiness rules.

**Verdict**

**NEEDS-CHANGES**. The plan is close, but the remaining issues are architectural, not cosmetic: deployment topology, shared-state correctness, and execution-policy semantics are still loose enough to cause real failures during implementation.