因果语言建模 CLM 教程

语言建模分为两种类型,即因果语言建模和掩码语言建模。本指南介绍的是因果语言建模。 因果语言模型经常用于文本生成。您可以将这些模型用于创意应用,如选择自己的文字冒险游戏或智能编码助手,如 Copilot 或 CodeParrot。

因果语言建模预测一系列标记中的下一个标记,模型只能关注左侧的标记。这意味着模型无法看到未来的标记。GPT-2就是一个因果语言模型的例子。

本指南将向您展示以下内容:

  1. ELI5 数据集的 r/askscience 子集上微调 DistilGPT2
  2. 使用微调后的模型进行推理。

开始之前,请确保已安装所有必要的库:

pip install transformers datasets evaluate

我们鼓励您登录Hugging Face帐户,这样您就可以上传和与社区分享您的模型。在提示时,输入您的令牌以登录:

>>> from huggingface_hub import notebook_login

>>> notebook_login()

加载ELI5数据集

首先,从🤗 Datasets库中加载 ELI5 数据集中的 r/askscience 的子集。 先在小数据集上进行测试实验,确保在一切正常,然后再花更多时间使用完整数据集进行训练。

>>> from datasets import load_dataset

>>> eli5 = load_dataset("eli5", split="train_asks[:5000]")

使用train_test_split方法将数据集的train_asks拆分为训练集和测试集:

>>> eli5 = eli5.train_test_split(test_size=0.2)

然后看一个示例:

>>> eli5["train"][0]
{'answers': {'a_id': ['c3d1aib', 'c3d4lya'],
  'score': [6, 3],
  'text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.",
   "Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"]},
 'answers_urls': {'url': []},
 'document': '',
 'q_id': 'nyxfp',
 'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?',
 'selftext_urls': {'url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg']},
 'subreddit': 'askscience',
 'title': 'Few questions about this space walk photograph.',
 'title_urls': {'url': []}}

尽管这看起来很多,但您实际上只要关注 text 字段。语言建模的有趣之处在于您不需要标签(也称为无监督任务),因为下一个单词(即标记)就是标签。

预处理

下一步是加载 DistilGPT2 的分词器来处理 text 字段:

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("distilgpt2")

从上面的示例中可以看出,text 字段实际上是嵌套在 answers 中的。这意味着您需要使用 flatten 方法从嵌套结构中提取text子字段:

>>> eli5 = eli5.flatten()
>>> eli5["train"][0]
{'answers.a_id': ['c3d1aib', 'c3d4lya'],
 'answers.score': [6, 3],
 'answers.text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.",
  "Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"],
 'answers_urls.url': [],
 'document': '',
 'q_id': 'nyxfp',
 'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?',
 'selftext_urls.url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg'],
 'subreddit': 'askscience',
 'title': 'Few questions about this space walk photograph.',
 'title_urls.url': []}

现在,每个子字段都是单独的列,如answers前缀所示,而text字段现在是一个列表。与其分别对每个句子进行分词,不如将列表转换为一个字符串,然后一起进行分词。

这是一个用于连接每个示例的字符串列表并对结果进行分词的预处理函数:

>>> def preprocess_function(examples):
...     return tokenizer([" ".join(x) for x in examples["answers.text"]])

要在整个数据集上应用此预处理函数,请使用 🤗 Datasets map 方法。您可以通过将 batched=True 设置来加速 map 函数,以同时处理数据集的多个元素,并可以通过 num_proc 增加进程的数量。删除您不需要的任何列:

>>> tokenized_eli5 = eli5.map(
...     preprocess_function,
...     batched=True,
...     num_proc=4,
...     remove_columns=eli5["train"].column_names,
... )

该数据集包含令牌序列,但其中一些序列的长度超过了模型的最大输入长度。

现在,您可以使用第二个预处理函数将其截断或填充为固定的长度,或者简单地忽略超过最大长度的标记化结果。

  • 把列表中的所有句子拼接成一个长字符串
  • 将长字符串按照block_size定义的长度进行分割,该长度应该小于最大输入长度,并且应该适合你的GPU内存。
>>> block_size = 128

>>> def group_texts(examples):
...     # 连接所有文本。
...     concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
...     total_length = len(concatenated_examples[list(examples.keys())[0]])
...     # 我们丢弃了剩余的部分,如果模型支持填充而不是丢弃,可以自定义此部分以满足您的需求。
...     if total_length >= block_size:
...         total_length = (total_length // block_size) * block_size
...     # 按 block_size 切分。
...     result = {
...         k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
...         for k, t in concatenated_examples.items()
...     }
...     result["labels"] = result["input_ids"].copy()
...     return result

group_texts 函数应用于整个数据集:

>>> lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4)

现在使用 DataCollatorForLanguageModeling 创建一个示例批次。在整理过程中,将句子动态填充到批次中的最长长度,而不是将整个数据集填充到最大长度。

Pytorch
Hide Pytorch content

将结束序列标记(EOS 标记)用作填充标记,并设置 mlm=False。这将使用右移一个元素的输入作为标签:

>>> from transformers import DataCollatorForLanguageModeling

>>> tokenizer.pad_token = tokenizer.eos_token
>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
TensorFlow
Hide TensorFlow content

将结束序列标记(EOS 标记)用作填充标记,并设置 mlm=False。这将使用右移一个元素的输入作为标签:

>>> from transformers import DataCollatorForLanguageModeling

>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors="tf")

训练

Pytorch
Hide Pytorch content

如果您对使用 Trainer 进行微调的模型不熟悉,请参阅 基础教程

现在可以开始训练模型了!使用 `AutoModelForCausalLM` 加载 DistilGPT2:
>>> from transformers import AutoModelForCausalLM, TrainingArguments, Trainer

>>> model = AutoModelForCausalLM.from_pretrained("distilgpt2")

此时只剩下三个步骤:

  1. 定义训练参数 TrainingArguments。唯一必需的参数是 output_dir,用于指定模型保存位置。将 push_to_hub=True 设置为上传模型到 Hugging Face Hub(需要登录 Hugging Face)。
  2. 将训练参数与模型、数据集和数据拼接器一起传递给 Trainer
  3. 调用 train() 进行微调。
>>> training_args = TrainingArguments(
...     output_dir="my_awesome_eli5_clm-model",
...     evaluation_strategy="epoch",
...     learning_rate=2e-5,
...     weight_decay=0.01,
...     push_to_hub=True,
... )

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     train_dataset=lm_dataset["train"],
...     eval_dataset=lm_dataset["test"],
...     data_collator=data_collator,
... )

>>> trainer.train()

训练完成后,使用 evaluate() 方法评估模型并获取 perplexity:

>>> import math

>>> eval_results = trainer.evaluate()
>>> print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")
Perplexity: 49.61

然后,使用 push_to_hub() 方法将模型分享至 Hub,以供其他人使用:

>>> trainer.push_to_hub()
TensorFlow
Hide TensorFlow content

如果您对使用 Keras 进行微调的模型不熟悉,请参阅 基础教程

要在 TensorFlow 中微调模型,请先设置优化器函数、学习率计划和一些训练超参数:
>>> from transformers import create_optimizer, AdamWeightDecay

>>> optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

然后可以使用 TFAutoModelForCausalLM 加载 DistilGPT2:

>>> from transformers import TFAutoModelForCausalLM

>>> model = TFAutoModelForCausalLM.from_pretrained("distilgpt2")

使用 prepare_tf_dataset() 将数据集转换为 tf.data.Dataset 格式:

>>> tf_train_set = model.prepare_tf_dataset(
...     lm_dataset["train"],
...     shuffle=True,
...     batch_size=16,
...     collate_fn=data_collator,
... )

>>> tf_test_set = model.prepare_tf_dataset(
...     lm_dataset["test"],
...     shuffle=False,
...     batch_size=16,
...     collate_fn=data_collator,
... )

使用 compile 配置模型进行训练。请注意,Transformers 模型都有一个默认的与任务相关的损失函数,因此除非您想要指定其他损失函数,否则不需要指定:

>>> import tensorflow as tf

>>> model.compile(optimizer=optimizer)  # 不需要指定损失参数!

可以通过在 PushToHubCallback 中指定模型和令牌化器的位置来完成这一操作:

>>> from transformers.keras_callbacks import PushToHubCallback

>>> callback = PushToHubCallback(
...     output_dir="my_awesome_eli5_clm-model",
...     tokenizer=tokenizer,
... )

最后,您已经准备好开始训练模型了!使用训练和验证数据集、训练的轮数以及回调函数来调用 fit 进行微调:

>>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3, callbacks=[callback])

训练完成后,您的模型会自动上传到 Hub,以供他人使用!

有关如何对定向语言建模进行更详细示例的方法,请查看相应的 PyTorch 笔记本TensorFlow 笔记本

推理

很棒!既然您已经微调了模型,就可以用它进行推理了!

构思一个您想要从中生成文本的提示:

>>> prompt = "Somatic hypermutation allows the immune system to"

使用 pipeline() 是在推理中使用微调后的模型最简单的方法。使用您的模型实例化一个文本生成的 pipeline,并将文本传递给它:

>>> from transformers import pipeline

>>> generator = pipeline("text-generation", model="my_awesome_eli5_clm-model")
>>> generator(prompt)
[{'generated_text': "Somatic hypermutation allows the immune system to be able to effectively reverse the damage caused by an infection.\n\n\nThe damage caused by an infection is caused by the immune system's ability to perform its own self-correcting tasks."}]
Pytorch
Hide Pytorch content

将文本进行 tokenize 分词并将 input_ids 作为 PyTorch 张量返回:

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_eli5_clm-model")
>>> inputs = tokenizer(prompt, return_tensors="pt").input_ids

使用generate()方法生成文本。有关不同的文本生成策略和控制生成的参数的更多详细信息,请参阅 文本生成策略 页面。

>>> from transformers import AutoModelForCausalLM

>>> model = AutoModelForCausalLM.from_pretrained("my_awesome_eli5_clm-model")
>>> outputs = model.generate(inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)

将生成的 token 标记 ID 解码为文本:

>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
["Somatic hypermutation allows the immune system to react to drugs with the ability to adapt to a different environmental situation. In other words, a system of 'hypermutation' can help the immune system to adapt to a different environmental situation or in some cases even a single life. In contrast, researchers at the University of Massachusetts-Boston have found that 'hypermutation' is much stronger in mice than in humans but can be found in humans, and that it's not completely unknown to the immune system. A study on how the immune system"]
TensorFlow
Hide TensorFlow content

将文本进行 tokenize 分词并将,并将input_ids以 TensorFlow 张量的形式返回:

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_eli5_clm-model")
>>> inputs = tokenizer(prompt, return_tensors="tf").input_ids

使用generate()方法生成摘要。有关不同的文本生成策略和控制生成的参数的更多详细信息,请参阅 文本生成策略 页面。

>>> from transformers import TFAutoModelForCausalLM

>>> model = TFAutoModelForCausalLM.from_pretrained("my_awesome_eli5_clm-model")
>>> outputs = model.generate(input_ids=inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)

将生成的标记ID解码为文本:

>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Somatic hypermutation allows the immune system to detect the presence of other viruses as they become more prevalent. Therefore, researchers have identified a high proportion of human viruses. The proportion of virus-associated viruses in our study increases with age. Therefore, we propose a simple algorithm to detect the presence of these new viruses in our samples as a sign of improved immunity. A first study based on this algorithm, which will be published in Science on Friday, aims to show that this finding could translate into the development of a better vaccine that is more effective for']