docs: add Colab notebook for full-talk transcription (notebooks/colab_full_transcribe.ipynb)

GPU(T4) 셀: ffmpeg+uv → 익명 clone → uv sync(engine+gpu) → detect → 오디오 업로드 → large-v3-turbo 풀 전사 → transcript.txt 다운로드. (Colab은 사내 게이트 미도달이라 전사 전용; 보정은 온프렘.) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
chore(omc): hotpaths (beam-size/correct/COLAB)
2026-06-09 07:33:54 +09:00 · 2026-06-09 07:29:37 +09:00 · 2026-06-09 07:29:37 +09:00
5 changed files with 267 additions and 21 deletions
@@ -30,7 +30,7 @@
  },
  "build": {
    "buildCommand": null,
-    "testCommand": "export PATH=\"$HOME/.local/bin:$HOME/.cargo/bin:$PATH\"\necho \"=== ruff ===\"; uv run ruff check src/ tests/ && echo clean\necho \"=== pytest ===\"; uv run pytest -q 2>&1 | tail -6\necho \"=== 청크 분할 빠른 점검 ===\"; uv run python -c \"\nfrom luke_scribe.postprocess import llm\nt='. '.join(f'문장{i} EmbeddingGemma' for i in range(300))\nch=llm._chunk(t, 200)\nprint('total chars', len(t), '→ chunks', len(ch), '| max chunk', max(len(c) for c in ch))\nprint('all<=200:', all(len(c)<=200 for c in ch))\n\"",
+    "testCommand": "export PATH=\"$HOME/.local/bin:$HOME/.cargo/bin:$PATH\"\necho \"=== ruff ===\"; uv run ruff check src/ tests/ && echo clean\necho \"=== pytest ===\"; uv run pytest -q 2>&1 | tail -2\necho \"=== --correct 경로(설정 없음 → 우아한 에러) ===\"\nuv run luke-scribe transcribe /tmp/jfk.flac --model tiny --language en --correct 2>&1 | tail -4; echo \"exit=${PIPESTATUS[0]}\"",
    "lintCommand": "ruff check",
    "devCommand": null,
    "scripts": {}
@@ -111,30 +111,30 @@
    }
  },
  "hotPaths": [
+    {
+      "path": "src/luke_scribe/cli.py",
+      "accessCount": 8,
+      "lastAccessed": 1780957705972,
+      "type": "file"
+    },
+    {
+      "path": "src/luke_scribe/config.py",
+      "accessCount": 5,
+      "lastAccessed": 1780957473801,
+      "type": "file"
+    },
    {
      "path": "scripts/llm_correct.py",
      "accessCount": 4,
      "lastAccessed": 1780925584647,
      "type": "file"
    },
-    {
-      "path": "src/luke_scribe/cli.py",
-      "accessCount": 4,
-      "lastAccessed": 1780927984393,
-      "type": "file"
-    },
    {
      "path": "pyproject.toml",
      "accessCount": 4,
      "lastAccessed": 1780928043613,
      "type": "file"
    },
-    {
-      "path": "src/luke_scribe/config.py",
-      "accessCount": 4,
-      "lastAccessed": 1780956547899,
-      "type": "file"
-    },
    {
      "path": "README.md",
      "accessCount": 3,
@@ -338,6 +338,12 @@
      "accessCount": 1,
      "lastAccessed": 1780928028187,
      "type": "file"
+    },
+    {
+      "path": "COLAB.md",
+      "accessCount": 1,
+      "lastAccessed": 1780957731994,
+      "type": "file"
    }
  ],
  "userDirectives": [
@@ -0,0 +1,79 @@
+# Colab / GPU 풀 전사 가이드
+
+GPU 환경(Colab T4/A100 또는 온프렘 GPU)에서 **풀 강연을 빠르게** 전사(+선택 보정)합니다.
+CPU(개발 박스)는 풀 강연이 느려(turbo ~RTF 5×) 비권장 — 여기서 돌리세요.
+GPU(T4)에서 turbo는 대략 실시간의 ~0.1~0.3× → **37분 강연이 수 분**.
+
+---
+
+## A) Google Colab — 전사 전용
+
+> Colab은 외부 클라우드라 **사내 LLM 게이트(192.168.0.123)에 못 닿습니다** → `--correct`(보정) 불가, **전사만**.
+> 런타임 → 런타임 유형 변경 → **GPU(T4)** 선택.
+
+```python
+# 1) 시스템 의존성 + uv
+!apt-get -qq update && apt-get -qq install -y ffmpeg
+!curl -LsSf https://astral.sh/uv/install.sh | sh
+import os; os.environ["PATH"] = "/root/.local/bin:" + os.environ["PATH"]
+
+# 2) 코드 (저장소 익명 read 허용)
+!git clone -b feat/p1-core https://git.lukehemmin.com/lukehemmin/luke_scribe.git
+%cd luke_scribe
+
+# 3) 의존성 (엔진 + GPU CUDA 런타임)
+!uv sync --extra engine --extra gpu
+
+# 4) GPU 인식 확인 (T3면 turbo+large-v3 동시상주)
+!uv run luke-scribe detect
+
+# 5) 오디오 업로드 (또는 Drive 마운트)
+from google.colab import files
+AUDIO = list(files.upload().keys())[0]
+
+# 6) 풀 전사 (large-v3-turbo) — 더 높은 정확도는 --model large-v3
+!uv run luke-scribe transcribe "$AUDIO" --model large-v3-turbo --language ko --timestamps | tee transcript.txt
+```
+
+### Colab을 API로 외부 노출하려면
+```python
+# cloudflared 공개 URL 발급 → 외부에서 curl
+!uv sync --extra engine --extra gpu --extra api
+import subprocess, os
+os.environ["SCRIBE_API_KEYS"] = '["colab-test"]'
+!nohup uv run luke-scribe serve --host 0.0.0.0 --port 8000 --tunnel cloudflare > serve.log 2>&1 &
+import time; time.sleep(8); print(open("serve.log").read())   # public *.trycloudflare.com URL 확인
+```
+
+---
+
+## B) 온프렘 GPU — 전사 + 사내 LLM 보정 (풀 파이프라인)
+
+사내망(게이트 192.168.0.123 도달) + GPU 머신이면 **음차→영문 복원까지** 한 번에:
+
+```bash
+git clone -b feat/p1-core https://git.lukehemmin.com/lukehemmin/luke_scribe.git && cd luke_scribe
+uv sync --extra engine --extra gpu
+
+export SCRIBE_LLM_BASE_URL=http://192.168.0.123:8080/v1
+export SCRIBE_LLM_API_KEY=<사내 키>          # 셸 히스토리 주의
+export SCRIBE_LLM_MODEL=copilot-gpt-4o
+export SCRIBE_LLM_MAX_CHARS=3000             # 사내 LLM 컨텍스트 창에 맞춰(~8k→1500/~16k→3000/~30k→6000)
+
+# 전사 + 청크 보정을 한 명령으로
+uv run luke-scribe transcribe talk.m4a --model large-v3-turbo --language ko --correct | tee transcript.txt
+```
+
+API로:
+```bash
+uv run luke-scribe serve                     # 출력된 X-API-Key 사용
+curl -H "X-API-Key: <키>" -F file=@talk.m4a -F model=large-v3-turbo -F correct=true \
+     http://localhost:8000/v1/transcribe
+```
+
+---
+
+## 참고
+- 보정은 긴 전사를 `SCRIBE_LLM_MAX_CHARS` 청크로 분할 + **러닝 글로서리**로 처리(작은 컨텍스트 창 대응).
+- 약 GPU(1050/2GB)는 turbo도 안 들어가 자동으로 **CPU(T0)** 로 강등 — `detect`로 등급 확인.
+- 오디오 파일은 저장소에 없음(`.gitignore`) — Colab 업로드/Drive 또는 온프렘 로컬 경로 사용.
@@ -0,0 +1,130 @@
+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# luke_scribe — Colab 풀 강연 전사\n",
+    "\n",
+    "GPU(T4)에서 풀 강연을 **수 분**에 전사합니다.\n",
+    "\n",
+    "**먼저:** 런타임 → 런타임 유형 변경 → 하드웨어 가속기 **GPU** 선택.\n",
+    "\n",
+    "> ⚠️ Colab은 외부라 **사내 LLM 게이트(192.168.0.123)에 못 닿습니다** → 보정(`--correct`) 불가, **전사만**. 보정까지는 사내망 GPU에서 (repo `COLAB.md` B절).\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "# 0) GPU 확인 (없으면 런타임 유형을 GPU로)\n",
+    "!nvidia-smi -L || echo \"GPU 없음 → 런타임 유형을 GPU로 바꾸세요\"\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "# 1) 시스템 의존성 + uv\n",
+    "!apt-get -qq update && apt-get -qq install -y ffmpeg\n",
+    "!curl -LsSf https://astral.sh/uv/install.sh | sh\n",
+    "import os\n",
+    "os.environ['PATH'] = '/root/.local/bin:' + os.environ['PATH']\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "# 2) 코드 가져오기 (저장소 익명 read 허용)\n",
+    "!git clone -b feat/p1-core https://git.lukehemmin.com/lukehemmin/luke_scribe.git\n",
+    "%cd luke_scribe\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "# 3) 의존성 (엔진 + GPU CUDA 런타임) — 수 분 소요\n",
+    "!uv sync --extra engine --extra gpu\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "# 4) 하드웨어 등급 확인 (T3 = turbo+large-v3 동시상주)\n",
+    "!uv run luke-scribe detect\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "# 5) 강연 오디오 업로드 (m4a/mp3/wav/mp4 …)\n",
+    "from google.colab import files\n",
+    "AUDIO = list(files.upload().keys())[0]\n",
+    "print('업로드:', AUDIO)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "# 6) 풀 전사 (large-v3-turbo; 더 정확히는 --model large-v3)\n",
+    "!uv run luke-scribe transcribe \"$AUDIO\" --model large-v3-turbo --language ko --timestamps | tee transcript.txt\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "metadata": {},
+   "execution_count": null,
+   "outputs": [],
+   "source": [
+    "# 7) 전사문 내려받기\n",
+    "from google.colab import files\n",
+    "files.download('transcript.txt')\n"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## 참고\n",
+    "- **모델**: `large-v3-turbo`(빠름) ↔ `large-v3`(정확). `detect`가 T0(CPU)면 약 GPU(느림).\n",
+    "- **보정(음차→영문)**: Colab 불가(게이트 미도달). 사내망 GPU에서 `--correct` + `SCRIBE_LLM_*` (`COLAB.md` B절).\n",
+    "- **속도**: T4 turbo ≈ 실시간 0.1~0.3× → 37분 강연 수 분.\n"
+   ]
+  }
+ ],
+ "metadata": {
+  "accelerator": "GPU",
+  "colab": {
+   "provenance": [],
+   "gpuType": "T4"
+  },
+  "kernelspec": {
+   "display_name": "Python 3",
+   "name": "python3"
+  },
+  "language_info": {
+   "name": "python"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 0
+}
@@ -55,6 +55,8 @@ def transcribe(
    device: str = typer.Option("auto", help="auto|cpu|cuda"),
    word_timestamps: bool = typer.Option(False, "--word-timestamps"),
    vad: bool = typer.Option(True, "--vad/--no-vad", help="무음 제거"),
+    beam_size: int = typer.Option(None, "--beam-size", help="디코딩 빔(CPU 1~2 권장=속도↑)"),
+    correct: bool = typer.Option(False, "--correct", help="사내 LLM 보정(SCRIBE_LLM_* 설정 필요)"),
    timestamps: bool = typer.Option(False, "--timestamps", help="세그먼트 [start–end] 표시"),
 ) -> None:
    """단발 파일 전사 (faster-whisper, CPU/GPU 자동, AC-4 일부)."""
@@ -90,17 +92,45 @@ def transcribe(
    )

    engine = FasterWhisperEngine(model_name, dev, profile.compute_type, cache_dir=settings.model_cache_dir)
-    segments, tinfo = engine.transcribe(file, language=lang, word_timestamps=word_timestamps, vad=vad)
+    segments, tinfo = engine.transcribe(
+        file, language=lang, word_timestamps=word_timestamps, vad=vad,
+        beam_size=(beam_size or settings.beam_size),
+    )

-    count = 0
+    seg_list = []
    for seg in segments:
-        count += 1
-        if timestamps:
-            console.print(f"[cyan][{seg.start:6.2f}–{seg.end:6.2f}][/] {seg.text.strip()}")
-        else:
-            console.print(seg.text.strip())
+        seg_list.append({"start": seg.start, "end": seg.end, "text": seg.text.strip()})
+        if not correct:  # 스트리밍 출력(보정 시엔 전체를 모은 뒤 한 번에)
+            if timestamps:
+                console.print(f"[cyan][{seg.start:6.2f}–{seg.end:6.2f}][/] {seg.text.strip()}")
+            else:
+                console.print(seg.text.strip())
+
+    if correct:
+        from .postprocess import llm as llm_correct
+        from .postprocess import rules
+
+        text = " ".join(s["text"] for s in seg_list).strip()
+        try:
+            text = rules.normalize(
+                llm_correct.correct(
+                    text,
+                    base_url=settings.llm_base_url,
+                    api_key=settings.llm_api_key,
+                    model=settings.llm_model,
+                    max_chars=settings.llm_max_chars,
+                )
+            )
+        except llm_correct.LLMNotConfigured as exc:
+            console.print(f"[red]--correct:[/] {exc}")
+            raise typer.Exit(code=1) from exc
+        console.print(text)
+
    detected = getattr(tinfo, "language", None)
-    console.print(f"[green]✓ {count} segments · detected_lang={detected} · model_used={model_name}[/]")
+    console.print(
+        f"[green]✓ {len(seg_list)} segments · detected_lang={detected} · "
+        f"model_used={model_name} · corrected={correct}[/]"
+    )


@app.command()
@@ -15,6 +15,7 @@ class Settings(BaseSettings):
    device: str = "auto"
    compute_type: str | None = None      # None=자동(cc/VRAM 기반)
    workers: int | None = None           # None=자동 산정
+    beam_size: int = 5                    # 디코딩 빔(CPU는 1~2 권장=속도↑, GPU는 5)

    # 언어 (기본 ko, 요청별 override)
    language: str = "ko"