解码策略

解码策略决定了模型如何选择下一个生成的token。存在许多类型的解码策略，选择合适的策略对生成文本的质量有着显著影响。

本文将帮助您了解Transformers中可用的不同解码策略，以及如何和何时使用它们。

基础解码方法

这些是经过验证的解码方法，应作为文本生成任务的起点。

贪婪搜索（Greedy Search）

贪婪搜索是默认的解码策略。它在每一步选择最有可能的下一个token。除非在GenerationConfig 中指定，否则此策略最多生成20个新token。

贪婪搜索适用于输出较短且不需要创造力的任务。然而，在生成较长序列时，它很容易陷入重复。

python

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
inputs = tokenizer("Hugging Face is an open-source company", return_tensors="pt").to("cuda")

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.float16).to("cuda")
# 显式设置为默认长度，因为Llama2的生成长度为4096
outputs = model.generate(**inputs, max_new_tokens=20)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

输出：

python

Hugging Face is an open-source company that provides a suite of tools and services for building, deploying, and maintaining natural language processing

采样（Sampling）

采样（或多项式采样）根据整个模型词汇表上的概率分布随机选择token（而不是像贪婪搜索那样选择最有可能的token）。这意味着任何具有非零概率的token都有机会被选中。采样策略可以减少重复，并生成更具创造性和多样性的输出。

通过设置do_sample=True和num_beams=1启用多项式采样。

python

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
inputs = tokenizer("Hugging Face is an open-source company", return_tensors="pt").to("cuda")

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.float16).to("cuda")
# 显式设置为100，因为Llama2的生成长度为4096
outputs = model.generate(**inputs, max_new_tokens=50, do_sample=True, num_beams=1)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

输出：

Hugging Face is an open-source company 🤗
We are open-source and believe that open-source is the best way to build technology. Our mission is to make AI accessible to everyone, and we believe that open-source is the best way to achieve that.

束搜索（Beam Search）

束搜索在每个时间步跟踪多个生成序列（束）。经过一定步数后，它选择整体概率最高的序列。与贪婪搜索不同，这种策略可以“向前看”，即使初始token的概率较低，也可以选择整体概率更高的序列。它最适合输入导向的任务，例如描述图像或语音识别。您也可以在束搜索中使用do_sample=True在每一步进行采样，但束搜索仍会在各步之间贪婪地剔除低概率序列。

通过设置num_beams参数（应大于1，否则等同于贪婪搜索）启用束搜索。

python

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
inputs = tokenizer("Hugging Face is an open-source company", return_tensors="pt").to("cuda")

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.float16).to("cuda")
# 显式设置为100，因为Llama2的生成长度为4096
outputs = model.generate(**inputs, max_new_tokens=50, num_beams=2)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

输出：

python

['Hugging Face is an open-source company that develops and maintains the Hugging Face platform, which is a collection of tools and libraries for building and deploying natural language processing (NLP) models. Hugging Face was founded in 2018 by Thomas Wolf']

高级解码方法

高级解码方法旨在解决特定的生成质量问题（例如重复）或在某些情况下提高生成吞吐量。这些技术较为复杂，可能并非适用于所有模型。

辅助解码（Speculative Decoding）

辅助解码并不是一种搜索或采样策略。相反，辅助解码通过添加一个较小的辅助模型来生成候选token。主模型在一次前向传递中验证候选token，从而加速整个解码过程。这种方法特别适用于大型语言模型（LLM），因为生成token的成本较高且速度较慢。

目前，辅助解码仅支持贪婪搜索和多项式采样，且不支持批量输入。

通过assistant_model参数启用辅助解码。当辅助模型远小于主模型时，您会发现速度提升最为显著。添加do_sample=True以启用带重采样的token验证。

python

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM-1.7B")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM-1.7B")
assistant_model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM-135M")
inputs = tokenizer("Hugging Face is an open-source company", return_tensors="pt")

outputs = model.generate(**inputs, assistant_model=assistant_model)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

输出：

python

Hugging Face is an open-source company that provides a platform for developers to build and deploy machine

辅助解码也支持通过assistant_model参数在Pipeline中使用。

python

from transformers import pipeline
import torch

pipe = pipeline(
    "text-generation",
    model="meta-llama/Llama-3.1-8B",
    assistant_model="meta-llama/Llama-3.2-1B",
    torch_dtype=torch.bfloat16
)
pipe_output = pipe("Once upon a time, ", max_new_tokens=50, do_sample=False)
pipe_output[0]["generated_text"]

提示查找解码（Prompt Lookup Decoding）

提示查找解码是辅助解码的一种变体，它使用重叠的n-gram作为候选token。它适用于输入导向的任务，例如摘要。

通过prompt_lookup_num_tokens参数启用提示查找解码。

python

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM-1.7B")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM-1.7B", torch_dtype=torch.float16).to("cuda")
assistant_model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM-135M", torch_dtype=torch.float16).to("cuda")
inputs = tokenizer("Hugging Face is an open-source company", return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, assistant_model=assistant_model, max_new_tokens=20, prompt_lookup_num_tokens=5)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

输出：

python

Hugging Face is an open-source company that provides a platform for developers to build and deploy machine learning models. It offers a variety of tools

自我辅助解码（Self-speculative Decoding）

使用language modeling heads 的早期隐藏状态作为输入，有效地跳过一些层以生成质量较低的输出。该低质量输出用作辅助输出，然后通过剩余层应用辅助解码来修正输出。这种自我辅助解码方法生成的最终结果与原始模型生成的结果相同（或具有相同的分布）。

辅助模型也是目标模型的一部分，因此可以共享缓存和权重，从而降低内存需求。

对于经过早期退出训练的模型，通过generate()中的assistant_early_exit参数传递。

python

from transformers import AutoModelForCausalLM, AutoTokenizer

prompt = "Alice and Bob"
checkpoint = "facebook/layerskip-llama3.2-1B"

tokenizer = AutoTokenizer.from_pretrained(checkpoint)
inputs = tokenizer(prompt, return_tensors="pt")

model = AutoModelForCausalLM.from_pretrained(checkpoint)
outputs = model.generate(**inputs, assistant_early_exit=4, do_sample=False, max_new_tokens=20)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

普适辅助解码（Universal Assisted Decoding）

普适辅助解码（UAD）允许主模型和辅助模型使用不同的分词器。主模型的输入token被重新编码为辅助模型的token。候选token在辅助编码中生成，然后重新编码为主模型的候选token。候选token的验证方式与辅助解码中所述相同。

重新编码涉及将token ID解码为文本，然后使用不同的分词器对文本进行编码。为了避免在重新编码过程中出现分词差异，UAD会找到源编码和目标编码之间的最长公共子序列，以确保新token包含正确的提示后缀。

通过在generate()中添加tokenizer和assistant_tokenizer参数启用UAD。

python

from transformers import AutoModelForCausalLM, AutoTokenizer

prompt = "Alice and Bob"

assistant_tokenizer = AutoTokenizer.from_pretrained("double7/vicuna-68m")
tokenizer = AutoTokenizer.from_pretrained("google/gemma-2-9b")
inputs = tokenizer(prompt, return_tensors="pt")

model = AutoModelForCausalLM.from_pretrained("google/gemma-2-9b")
assistant_model = AutoModelForCausalLM.from_pretrained("double7/vicuna-68m")
outputs = model.generate(**inputs, assistant_model=assistant_model, tokenizer=tokenizer, assistant_tokenizer=assistant_tokenizer)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

输出：

['Alice and Bob are sitting in a bar. Alice is drinking a beer and Bob is drinking a']

对比搜索（Contrastive Search）

对比搜索是一种旨在减少重复的解码策略，即使在生成较长序列时也是如此。该策略比较生成token与之前token的相似度，如果它们更相似，则会应用惩罚。

通过penalty_alpha和top_k参数启用对比搜索。penalty_alpha管理应用的惩罚，top_k是返回的最有可能的token数量。

python

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
inputs = tokenizer("Hugging Face is an open-source company", return_tensors="pt").to("cuda")

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.float16).to("cuda")
# 显式设置为100，因为Llama2的生成长度为4096
outputs = model.generate(**inputs, max_new_tokens=100, penalty_alpha=0.6, top_k=4)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

输出：

python

Hugging Face is an open-source company that provides a platform for building and deploying AI models.
Hugging Face is an open-source company that provides a platform for building and deploying AI models. The platform allows developers to build and deploy AI models, as well as collaborate with other developers.
Hugging Face was founded in 2019 by Thibault Wittemberg and Clément Delangue. The company is based in Paris, France.
Hugging Face has

DoLa（Decoding by Contrasting Layers）

DoLa是一种对比解码策略，旨在提高事实性并减少幻觉。该策略通过对比最终层与早期层之间的logit差异来实现。因此，特定层中的事实性知识得以增强。不建议将DoLa用于较小的模型，例如GPT-2。

通过以下参数启用DoLa：

dola_layers：这些是与最终层进行对比的候选层。它可以是一个字符串（low或high），用于对比层的较低或较高部分。对于短答案任务（如TruthfulQA），建议使用high；对于长答案推理任务（如GSM8K、StrategyQA、FACTOR和VicunaQA），建议使用low。
如果模型的词嵌入是绑定的，则跳过第0层，从第2层开始。
它也可以是一个整数列表，表示0到总层数之间的层索引。第0层是词嵌入，第1层是第一个Transformer层，依此类推。根据模型层数的不同，层索引范围如下表所示：

层数	`low`	`high`
40	(0, 20, 2)	(N - 20, N, 2)
<= 40	range(0, N // 2, 2)	range(N // 2, N, 2)

repetition_penalty：减少重复，建议设置为1.2。

python

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("HuggingFaceTB/SmolLM-1.7B")
model = AutoModelForCausalLM.from_pretrained("HuggingFaceTB/SmolLM-1.7B", torch_dtype=torch.float16).to("cuda")
inputs = tokenizer("What is the highest peak in the world??", return_tensors="pt").to("cuda")

outputs = model.generate(**inputs, max_new_tokens=50, dola_layers="high", do_sample=False)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

输出：

Mount EverestMount Everest, called Himalaya in Nepali, is the world's highest peak, lying almost 9.5 kilometers above the sea level and the tallest mountain from 19,036.91 ft. The mountain was

多样性束搜索（Diverse Beam Search）

多样性束搜索是束搜索的一种变体，旨在生成更具多样性的输出候选结果。该策略衡量序列之间的差异性，如果序列过于相似，则会应用惩罚。为了避免高昂的计算成本，束被分成若干组。

通过num_beams、num_beam_groups和diversity_penalty参数启用多样性束搜索（num_beams应能被num_beam_groups整除）。

python

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b-hf")
inputs = tokenizer("Hugging Face is an open-source company", return_tensors="pt").to("cuda")

model = AutoModelForCausalLM.from_pretrained("meta-llama/Llama-2-7b-hf", torch_dtype=torch.float16).to("cuda")
# 显式设置为100，因为Llama2的生成长度为4096
outputs = model.generate(**inputs, max_new_tokens=50, num_beams=6, num_beam_groups=3, diversity_penalty=1.0, do_sample=False)
tokenizer.batch_decode(outputs, skip_special_tokens=True)

输出：

Hugging Face is an open-source company 🤗
We are an open-source company. Our mission is to democratize AI and make it accessible to everyone. We believe that AI should be used for the benefit of humanity, not for the benefit of a

自定义解码方法

自定义解码方法可以实现特定的生成行为，例如：

如果模型不确定，则继续思考；
如果模型陷入困境，则回滚生成；
使用自定义逻辑处理特殊token；
为高级模型增强输入准备。

我们通过模型仓库启用自定义解码方法，假设其具有特定的模型标签和文件结构（见下文子节）。此功能是自定义建模代码的扩展，同样需要设置trust_remote_code=True。

如果模型仓库包含自定义解码方法，尝试它的最简单方法是加载模型并使用它进行生成：

python

from transformers import AutoModelForCausalLM, AutoTokenizer

# `transformers-community/custom_generate_example`是`Qwen/Qwen2.5-0.5B-Instruct`的一个副本，
# 但它带有自定义生成代码 -> 调用`generate`将使用自定义解码方法！
tokenizer = AutoTokenizer.from_pretrained("transformers-community/custom_generate_example")
model = AutoModelForCausalLM.from_pretrained(
    "transformers-community/custom_generate_example", device_map="auto", trust_remote_code=True
)

inputs = tokenizer(["The quick brown"], return_tensors="pt").to(model.device)
# 自定义解码方法是一个最小化的贪婪解码实现。它还会在运行时打印一条自定义消息。
gen_out = model.generate(**inputs)
# 现在您应该会看到它的自定义消息："✨ using a custom generation method ✨"
print(tokenizer.batch_decode(gen_out, skip_special_tokens=True))

输出：

The quick brown fox jumps over a lazy dog, and the dog is a type of animal. Is

带有自定义解码方法的模型仓库有一个特殊属性：其解码方法可以通过generate()的custom_generate参数从任何模型中加载。这意味着任何人都可以创建并共享其自定义生成方法，使其能够与任何Transformers模型一起使用，而无需用户安装额外的Python包。

python

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct")
model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-0.5B-Instruct", device_map="auto")

inputs = tokenizer(["The quick brown"], return_tensors="pt").to(model.device)
# `custom_generate`用自定义解码方法替换了原始的`generate`
gen_out = model.generate(**inputs, custom_generate="transformers-community/custom_generate_example", trust_remote_code=True)
print(tokenizer.batch_decode(gen_out, skip_special_tokens=True)[0])

输出：

The quick brown fox jumps over a lazy dog, and the dog is a type of animal. Is

您应该阅读包含自定义生成策略的仓库的README.md文件，以了解是否存在新的参数和输出类型差异。否则，您可以假设它的工作方式与基础generate()方法相同。

您可以通过搜索其自定义标签（custom_generate）来找到所有自定义解码方法。

以transformers-community/custom_generate_example仓库为例。其README.md指出，它有一个额外的输入参数left_padding，用于在提示之前添加一定数量的填充token。

python

gen_out = model.generate(
    **inputs, custom_generate="transformers-community/custom_generate_example", trust_remote_code=True, left_padding=5
)
print(tokenizer.batch_decode(gen_out)[0])

输出：

python

<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>The quick brown fox jumps over the lazy dog.

The sentence "The quick"

如果自定义方法的Python依赖项与您的环境不匹配，您将收到有关缺少依赖项的异常。例如，transformers-community/custom_generate_bad_requirements在其custom_generate/requirements.txt文件中定义了一组不可能满足的依赖项，如果您尝试运行它，您将看到以下错误消息。

python

ImportError: Missing requirements in your local environment for `transformers-community/custom_generate_bad_requirements`:
foo (installed: None)
bar==0.0.0 (installed: None)
torch>=99.0 (installed: 2.6.0)

根据提示更新您的Python依赖项将消除此错误消息。

创建自定义解码方法

要创建新的解码方法，您需要创建一个新的模型仓库，并将一些文件推送到其中。

您设计解码方法所使用的模型。
custom_generate/generate.py，其中包含您自定义解码方法的所有逻辑。
custom_generate/requirements.txt，用于可选地添加新的Python依赖项和/或锁定特定版本，以便正确使用您的方法。
README.md，您应在此处添加custom_generate标签，并记录您自定义方法的任何新参数或输出类型差异。

添加所有必需的文件后，您的仓库应如下所示：

plaintext

your_repo/
├── README.md          # 包含`custom_generate`标签
├── config.json
├── ...
└── custom_generate/
    ├── generate.py
    └── requirements.txt

添加基础模型

您自定义解码方法的起点是一个普通的模型仓库。要添加到此仓库的模型应该是您设计方法时所使用的模型，并且它应该是工作中的一个完整的自包含模型-生成对的一部分。当加载此仓库中的模型时，您的自定义解码方法将覆盖generate。不用担心——您的解码方法仍然可以与任何其他Transformers模型一起加载，如上节所述。

如果您只是想复制一个现有的模型，可以执行以下操作：

python

from transformers import AutoModelForCausalLM, AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("source/model_repo")
model = AutoModelForCausalLM.from_pretrained("source/model_repo")
tokenizer.save_pretrained("your/decoding_method", push_to_hub=True)
model.save_pretrained("your/decoding_method", push_to_hub=True)

generate.py

这是您解码方法的核心。它必须包含一个名为generate的方法，该方法的第一个参数必须是model。model是模型实例，这意味着您可以访问模型中的所有属性和方法，包括在GenerationMixin中定义的方法（例如基础generate方法）。

generate.py必须放在名为custom_generate的文件夹中，而不是仓库的根目录中。此功能的文件路径是硬编码的。

在内部，当基础generate()方法被调用时，如果带有custom_generate参数，它将首先检查其Python依赖项（如果有），然后在generate.py中定位自定义generate方法，最后调用自定义generate。所有接收到的参数和模型都会转发到您的自定义generate方法中。

这意味着您的generate可以包含原始和自定义参数的混合（以及不同的输出类型），如下所示：

python

import torch

def generate(model, input_ids, generation_config=None, left_padding=None, **kwargs):
    generation_config = generation_config or model.generation_config  # 默认使用模型的生成配置
    cur_length = input_ids.shape[1]
    max_length = generation_config.max_length or cur_length + generation_config.max_new_tokens

    # 示例自定义参数：在提示之前添加`left_padding`（整数）个填充token
    if left_padding is not None:
        if not isinstance(left_padding, int) or left_padding < 0:
            raise ValueError(f"left_padding必须是一个大于0的整数，但当前值为{left_padding}")

        pad_token = kwargs.pop("pad_token", None) or generation_config.pad_token_id or model.config.pad_token_id
        if pad_token is None:
            raise ValueError("pad_token未定义")
        batch_size = input_ids.shape[0]
        pad_tensor = torch.full(size=(batch_size, left_padding), fill_value=pad_token).to(input_ids.device)
        input_ids = torch.cat((pad_tensor, input_ids), dim=1)
        cur_length = input_ids.shape[1]

    # 简单的贪婪解码循环
    while cur_length < max_length:
        logits = model(input_ids).logits
        next_token_logits = logits[:, -1, :]
        next_tokens = torch.argmax(next_token_logits, dim=-1)
        input_ids = torch.cat((input_ids, next_tokens[:, None]), dim=-1)
        cur_length += 1

    return input_ids

为了确保您的自定义解码方法按预期工作，请遵循以下推荐实践：

尽量重用基础generate()中的验证和输入准备逻辑。
如果在模型中使用了任何私有方法/属性，请在requirements.txt中固定transformers版本。
您可以在custom_generate文件夹中添加其他文件，并使用相对导入。
考虑添加模型验证、输入验证，甚至一个单独的测试文件，以帮助用户在他们的环境中验证您的代码。

requirements.txt

您可以在custom_generate文件夹内的requirements.txt文件中可选地指定额外的Python依赖项。这些依赖项会在运行时进行检查，如果缺失，将抛出异常，提示用户更新他们的环境。

README.md

模型仓库根目录中的README.md通常描述该仓库中的模型。然而，由于仓库的重点是自定义解码方法，我们强烈建议您将重点转移到描述自定义解码方法上。除了对方法的描述外，我们还建议您记录与原始generate()相比的输入和/或输出差异。这样，用户可以专注于新内容，而依赖Transformers文档来了解通用实现细节。

为了提高可发现性，我们强烈建议您为仓库添加custom_generate标签。为此，您的README.md文件的顶部应如下所示。推送文件后，您应该会在仓库中看到该标签！

plaintext

library_name: transformers
tags:
  - custom_generate

(您的Markdown内容)

推荐实践：

在generate()中记录输入和输出差异。
添加自包含的示例，以便快速进行实验。
描述软性要求，例如该方法仅适用于某一类模型。

解码策略 ​

基础解码方法 ​

贪婪搜索（Greedy Search） ​

采样（Sampling） ​

束搜索（Beam Search） ​

高级解码方法 ​

辅助解码（Speculative Decoding） ​

提示查找解码（Prompt Lookup Decoding） ​

自我辅助解码（Self-speculative Decoding） ​

普适辅助解码（Universal Assisted Decoding） ​

对比搜索（Contrastive Search） ​

DoLa（Decoding by Contrasting Layers） ​

多样性束搜索（Diverse Beam Search） ​

自定义解码方法 ​

创建自定义解码方法 ​

添加基础模型 ​

generate.py ​

requirements.txt ​

README.md ​

解码策略

基础解码方法

贪婪搜索（Greedy Search）

采样（Sampling）

束搜索（Beam Search）

高级解码方法

辅助解码（Speculative Decoding）

提示查找解码（Prompt Lookup Decoding）

自我辅助解码（Self-speculative Decoding）

普适辅助解码（Universal Assisted Decoding）

对比搜索（Contrastive Search）

DoLa（Decoding by Contrasting Layers）

多样性束搜索（Diverse Beam Search）

自定义解码方法

创建自定义解码方法

添加基础模型

generate.py

requirements.txt

README.md