使用 Whisper 实现视频语音转文本

OpenAI Whisper 是一个开源的语音识别模型，可以免费在 GitHub 上获取。本文将介绍如何使用 Python 和 FastAPI，把 YouTube 视频的语音转录为文本，并搭建一个可访问的 Web 服务。

1. 安装依赖

首先，将需要的 Python 包写入 requirements.txt 文件。

text

openai-whisper
pytube
fastapi
uvicorn

注意：PyPI 上存在一个名为 whisper 的第三方包，但它不是 OpenAI 的 Whisper。为了避免混淆，建议安装 openai-whisper，或直接从 GitHub 安装：
text
git+https://github.com/openai/whisper.git
1

1.1 虚拟环境（可选）

推荐使用 virtualenv 创建独立的 Python 环境，确保安装正确的包：

bash

# 安装 virtualenv
pip install virtualenv

# 创建名为 youtext 的虚拟环境
virtualenv youtext

# 在类 Unix 系统上激活环境
source youtext/bin/activate

# 安装依赖
pip install -r requirements.txt

2. 下载 YouTube 视频

创建一个 download.py 文件，用于下载 YouTube 视频。为了处理视频标题中可能出现的特殊字符，这里使用 MD5 对文件名进行哈希。

python

import hashlib
from pathlib import Path
from pytube import YouTube


def download_video(url: str) -> dict:
    """下载 YouTube 视频，返回文件名和标题。"""
    yt = YouTube(url)
    hash_file = hashlib.md5()
    hash_file.update(yt.title.encode("utf-8"))
    file_name = f"{hash_file.hexdigest()}.mp4"

    # 优先下载包含音视频流的 progressive 资源
    stream = yt.streams.filter(progressive=True).first()
    if stream is None:
        stream = yt.streams.first()
    stream.download(output_path=".", filename=file_name)

    return {
        "file_name": file_name,
        "title": yt.title,
    }

3. 转录视频

创建一个 transcribe.py 文件，使用 Whisper 对下载的视频进行转录。

python

import os
import whisper
from download import download_video


# 加载 Whisper 模型
# 可选模型：tiny、base、small、medium、large 等
# 英文视频可使用 base.en 等英文专用模型
model_name = "base.en"
model = whisper.load_model(model_name)


def format_item(item: dict) -> dict:
    """将 Whisper 的 segment 格式化为时间戳与文本。"""
    return {
        "time": item["start"],
        "text": item["text"],
    }


def transcribe(url: str) -> dict:
    """下载视频、转录文本并清理本地视频文件。"""
    video = download_video(url)
    result = model.transcribe(video["file_name"])
    os.remove(video["file_name"])

    segments = [format_item(item) for item in result["segments"]]

    return {
        "title": video["title"],
        "segments": segments,
    }

model.transcribe() 返回的结果包含多个字段，本文只关心 start（起始时间戳）和 text（文本内容）。这两个字段随后会被用于生成可以跳转到 YouTube 视频对应时间点的链接。

4. 使用 FastAPI 提供 Web 服务

创建一个 app.py 文件，使用 FastAPI 暴露两个端点：一个用于展示前端页面，另一个用于接收 YouTube URL 并返回转录结果。

python

from fastapi import FastAPI, Form
from fastapi.responses import HTMLResponse
from fastapi.staticfiles import StaticFiles
from transcribe import transcribe


app = FastAPI()
app.mount("/static", StaticFiles(directory="static"), name="static")


@app.get("/", response_class=HTMLResponse)
def index():
    """返回静态首页 HTML。"""
    with open("static/index.html", "r", encoding="utf-8") as f:
        return f.read()


@app.post("/api")
def api(url: str = Form(...)):
    """接收 YouTube URL，返回转录结果。"""
    data = transcribe(url)
    return {
        "url": url,
        "data": data,
    }

两条路由的说明：

GET /：返回 static/index.html 文件的内容。
POST /api：从表单中获取 url 参数，调用 transcribe(url) 并返回 JSON。

5. HTML 用户界面

在 static/index.html 中，使用 CDN 引入 Tailwind CSS 和 jQuery，构建一个简单的用户界面。

html

<!DOCTYPE html>
<html lang="zh-CN">
<head>
    <meta charset="UTF-8">
    <meta name="viewport" content="width=device-width, initial-scale=1.0">
    <title>YouText</title>
    <script src="https://cdn.tailwindcss.com"></script>
    <script src="https://cdnjs.cloudflare.com/ajax/libs/jquery/3.6.3/jquery.min.js"
        integrity="sha512-STof4xm1wgkfm7heWqFJVn58Hm3EtS31XFaagaa8VMReCXAkQnJZ+jEy8PCC/iT18dFy95WcExNHFTqLyp72eQ=="
        crossorigin="anonymous" referrerpolicy="no-referrer"></script>
</head>
<body>
    <div class="text-5xl font-extrabold max-w-3xl mx-auto p-12 text-center">
        <span class="bg-clip-text text-transparent bg-gradient-to-r from-pink-500 to-violet-500">
            YouText
        </span>
        <p class="text-2xl font-light text-gray-600">Convert YouTube video to Text.</p>
    </div>

    <form id="form" class="max-w-3xl mx-auto space-y-4 p-8">
        <input name="url" class="rounded-sm p-2 w-full border" placeholder="Type youtube url here..." />
        <button type="submit" class="text-white bg-violet-500 rounded-sm w-full py-2">Submit</button>
    </form>

    <div class="max-w-3xl mx-auto space-y-4 p-8 bg-gray-200 relative">
        <h2 id="title" class="font-semibold text-2xl"></h2>
        <div class="absolute top-2 right-2 hover:cursor-pointer" id="copy-text">
            <!-- SVG Icon, 见 https://gist.github.com/ahmadrosid/73b006f9265a262ace151bbce3a2d7fb -->
        </div>
        <div id="result"></div>
    </div>

    <script>
        jQuery(document).ready(() => {
            jQuery("#form").submit(function (e) {
                e.preventDefault();

                const formData = jQuery(this).serialize();
                const req = jQuery.post("/api", formData, (data) => {
                    let sentenceBuffer = "";

                    data.data.segments.forEach(item => {
                        const time = parseInt(item.time, 10).toFixed(0);
                        sentenceBuffer += `<a class="hover:text-violet-600" href="${data.url}&t=${time}s">${item.text}</a>`;

                        if (item.text.includes(".")) {
                            jQuery("#result").append(`<p class="pb-2">${sentenceBuffer}</p>`);
                            sentenceBuffer = "";
                        }
                    });

                    if (sentenceBuffer !== "") {
                        jQuery("#result").append(`<p>${sentenceBuffer}</p>`);
                    }

                    jQuery("#title").text(data.data.title);

                    jQuery("#copy-text").on("click", () => {
                        const inputElement = jQuery("<textarea>");
                        jQuery("body").append(inputElement);
                        const texts = data.data.segments.map(item => item.text).join("").trim();
                        inputElement.val(texts).select();
                        document.execCommand("copy");
                        inputElement.remove();
                    });
                });

                req.fail((err) => {
                    console.error(err);
                });
            });
        });
    </script>
</body>
</html>

页面包含一个标题、一个输入框、一个提交按钮、一个结果显示容器，以及一个复制文本的按钮。点击转录结果中的任意文本片段，会跳转到 YouTube 视频的对应时间点。

YouText 用户界面

6. 运行与结论

在项目根目录下运行：

bash

uvicorn app:app --reload

然后打开浏览器访问 http://127.0.0.1:8000 即可使用。

本文介绍了如何使用 OpenAI Whisper 将 YouTube 视频语音转录为文本，并用 FastAPI 提供 Web 服务。完整源代码可参考 YouText 仓库。

参考

https://ahmadrosid.com/blog/youtube-transcriptioin-with-openai-whisper

使用 Whisper 实现视频语音转文本 ​

1. 安装依赖 ​

1.1 虚拟环境（可选） ​

2. 下载 YouTube 视频 ​

3. 转录视频 ​

4. 使用 FastAPI 提供 Web 服务 ​

5. HTML 用户界面 ​

6. 运行与结论 ​

参考 ​

使用 Whisper 实现视频语音转文本

1. 安装依赖

1.1 虚拟环境（可选）

2. 下载 YouTube 视频

3. 转录视频

4. 使用 FastAPI 提供 Web 服务

5. HTML 用户界面

6. 运行与结论

参考