掩码语言模型 MLM 教程

掩码语言模型预测序列中的一个掩码标记，模型可以双向关注标记。这意味着模型可以完全访问左右两侧的标记。掩码语言模型非常适用于需要对整个序列进行良好上下文理解的任务。BERT 就是一个掩码语言模型的例子。

本指南将指导您完成以下操作：

对 ELI5 数据集中的 r/askscience 子集进行 DistilRoBERTa 微调。
使用微调后的模型进行推理。

本教程同样适用于下面其他架构的掩码语言模型微调。

ALBERT, BART, BERT, BigBird, CamemBERT, ConvBERT, Data2VecText, DeBERTa, DeBERTa-v2, DistilBERT, ELECTRA, ERNIE, ESM, FlauBERT, FNet, Funnel Transformer, I-BERT, LayoutLM, Longformer, LUKE, mBART, MEGA, Megatron-BERT, MobileBERT, MPNet, MRA, MVP, Nezha, Nyströmformer, Perceiver, QDQBert, Reformer, RemBERT, RoBERTa, RoBERTa-PreLayerNorm, RoCBert, RoFormer, SqueezeBERT, TAPAS, Wav2Vec2, XLM, XLM-RoBERTa, XLM-RoBERTa-XL, X-MOD, YOSO

开始之前，请确保已安装所有必要的库：

pip install transformers datasets evaluate

我们建议您登录 Hugging Face 账户，以便与社区共享和上传您的模型。在提示时，输入您的令牌以登录：

>>> from huggingface_hub import notebook_login

>>> notebook_login()

加载 ELI5 数据集

首先，从🤗 Datasets库中加载 ELI5 数据集中的 r/askscience 的子集。先在小数据集上进行测试实验，确保在一切正常，然后再花更多时间使用完整数据集进行训练。

>>> from datasets import load_dataset

>>> eli5 = load_dataset("eli5", split="train_asks[:5000]")

使用 train_test_split 方法将数据集的 train_asks 分割为训练集和测试集：

>>> eli5 = eli5.train_test_split(test_size=0.2)

然后查看一个示例：

>>> eli5["train"][0]
{'answers': {'a_id': ['c3d1aib', 'c3d4lya'],
  'score': [6, 3],
  'text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.",
   "Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"]},
 'answers_urls': {'url': []},
 'document': '',
 'q_id': 'nyxfp',
 'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?',
 'selftext_urls': {'url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg']},
 'subreddit': 'askscience',
 'title': 'Few questions about this space walk photograph.',
 'title_urls': {'url': []}}

尽管看起来很多，但您实际上只要关注 text 字段。语言建模的有趣之处在于您不需要标签（也称为无监督任务），因为下一个单词（即标记）就是标签。

预处理

对于掩码语言模型，下一步是加载一个 DistilRoBERTa 的 tokenizer 来处理 text 子字段：

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("distilroberta-base")

从上面的示例中可以看出，text 字段实际上是嵌套在 answers 内部的。这意味着您需要使用 flatten 方法从嵌套的结构中提取 text 子字段：

>>> eli5 = eli5.flatten()
>>> eli5["train"][0]
{'answers.a_id': ['c3d1aib', 'c3d4lya'],
 'answers.score': [6, 3],
 'answers.text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.",
  "Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"],
 'answers_urls.url': [],
 'document': '',
 'q_id': 'nyxfp',
 'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?',
 'selftext_urls.url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg'],
 'subreddit': 'askscience',
 'title': 'Few questions about this space walk photograph.',
 'title_urls.url': []}

现在每个子字段都是一个单独的列，如 answers 前缀所示，而 text 字段现在是一个列表。我们需要把列表中的所有句子拼接成一个长字符串，然后对这个长字符串进行 tokenize 分词。

下面是一个首先的预处理函数，用于连接每个示例的字符串列表并对结果进行 tokenize 分词。：

>>> def preprocess_function(examples):
...     return tokenizer([" ".join(x) for x in examples["answers.text"]])

要将此预处理函数应用于整个数据集，可以使用 🤗 Datasets map 方法。您可以通过将 batched=True 设置为一次处理数据集的多个元素，以及使用 num_proc 增加进程数来加快 map 函数的处理速度。还可以删除不需要的列：

>>> tokenized_eli5 = eli5.map(
...     preprocess_function,
...     batched=True,
...     num_proc=4,
...     remove_columns=eli5["train"].column_names,
... )

该数据集包含令牌序列，但其中一些序列的长度超过了模型的最大输入长度。

现在，您可以使用第二个预处理函数将其截断或填充为固定的长度，或者简单地忽略超过最大长度的标记化结果。

把列表中的所有句子拼接成一个长字符串
将长字符串按照block_size定义的长度进行分割，该长度应该小于最大输入长度，并且应该适合你的GPU内存。

>>> block_size = 128


>>> def group_texts(examples):
...     # 连接所有文本。
...     concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
...     total_length = len(concatenated_examples[list(examples.keys())[0]])
...     # 我们丢弃剩余的小部分，如果模型支持填充，可以用填充代替丢弃，你可以根据需要自定义此部分。
...     if total_length >= block_size:
...         total_length = (total_length // block_size) * block_size
...     # 按照 block_size 进行分块。
...     result = {
...         k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
...         for k, t in concatenated_examples.items()
...     }
...     return result

在整个数据集上应用group_texts函数：

>>> lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4)

现在使用 DataCollatorForLanguageModeling 创建一批示例。在整个数据集上 动态填充 句子到批次中的最长长度，而不是将整个数据集填充到最大长度。

Pytorch

Hide Pytorch content

将 EOS 句子结束标记符号作为填充标记符号，并指定 mlm_probability 来在每次迭代数据时随机屏蔽标记符号：

>>> from transformers import DataCollatorForLanguageModeling

>>> tokenizer.pad_token = tokenizer.eos_token
>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)

TensorFlow

Hide TensorFlow content

将 EOS 句子结束标记符号作为填充标记符号，并指定 mlm_probability 来在每次迭代数据时随机屏蔽标记符号：

>>> from transformers import DataCollatorForLanguageModeling

>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15, return_tensors="tf")

训练

Pytorch

Hide Pytorch content

如果你对使用 Trainer 进行模型微调不熟悉，请查看这里的基础教程!

现在你可以开始训练你的模型了！加载 DistilRoBERTa 模型使用 AutoModelForMaskedLM：

>>> from transformers import AutoModelForMaskedLM

>>> model = AutoModelForMaskedLM.from_pretrained("distilroberta-base")

在这一点上，只剩下三个步骤：

使用 TrainingArguments 定义训练超参数。唯一必需的参数是 output_dir，用于指定保存模型的位置。通过设置 push_to_hub=True 将模型推送到 Hub（上传模型需要登录 Hugging Face）。
将训练参数与模型、数据集和数据编码器一起传递给 Trainer。
调用 train() 进行微调模型。

>>> training_args = TrainingArguments(
...     output_dir="my_awesome_eli5_mlm_model",
...     evaluation_strategy="epoch",
...     learning_rate=2e-5,
...     num_train_epochs=3,
...     weight_decay=0.01,
...     push_to_hub=True,
... )

>>> trainer = Trainer(
...     model=model,
...     args=training_args,
...     train_dataset=lm_dataset["train"],
...     eval_dataset=lm_dataset["test"],
...     data_collator=data_collator,
... )

>>> trainer.train()

训练完成后，使用 evaluate() 方法评估模型并获得困惑度：

>>> import math

>>> eval_results = trainer.evaluate()
>>> print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")
Perplexity: 8.76

然后使用 push_to_hub() 方法将模型分享到 Hub，这样每个人都可以使用你的模型：

>>> trainer.push_to_hub()

TensorFlow

Hide TensorFlow content

如果你对使用 Keras 进行模型微调不熟悉，请查看这里的基础教程!

要在 TensorFlow 中进行模型微调，请首先设置优化器函数、学习率调度和一些训练超参数：

>>> from transformers import create_optimizer, AdamWeightDecay

>>> optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)

然后使用 TFAutoModelForMaskedLM 加载 DistilRoBERTa 模型：

>>> from transformers import TFAutoModelForMaskedLM

>>> model = TFAutoModelForMaskedLM.from_pretrained("distilroberta-base")

使用 prepare_tf_dataset() 将你的数据集转换为 tf.data.Dataset 格式：

>>> tf_train_set = model.prepare_tf_dataset(
...     lm_dataset["train"],
...     shuffle=True,
...     batch_size=16,
...     collate_fn=data_collator,
... )

>>> tf_test_set = model.prepare_tf_dataset(
...     lm_dataset["test"],
...     shuffle=False,
...     batch_size=16,
...     collate_fn=data_collator,
... )

使用 compile 配置模型的训练。注意，Transformers 模型都有默认的与任务相关的损失函数，所以你不需要指定损失函数，除非你需要自定义的：

>>> import tensorflow as tf

>>> model.compile(optimizer=optimizer)  # 无需指定损失参数！

可以使用 PushToHubCallback 指定将模型和分词器推送到 huggingface 网站的哪个位置：

>>> from transformers.keras_callbacks import PushToHubCallback

>>> callback = PushToHubCallback(
...     output_dir="my_awesome_eli5_mlm_model",
...     tokenizer=tokenizer,
... )

最后，你可以开始训练模型了！使用 fit ，传入训练和验证数据集、迭代轮数以及回调函数来微调模型：

>>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3, callbacks=[callback])

训练完成后，你的模型会自动上传到 Hub，供每个人使用！

关于如何为掩码语言模型进行更深入的示例，可以参考对应的 PyTorch笔记本或 TensorFlow笔记本。

推理

很好，现在你已经微调了一个模型，可以用它进行推理了！

编写一些你希望模型填充空白的文本，并使用特殊的<mask>标记表示空白处：

>>> text = "The Milky Way is a <mask> galaxy."

对于推理，使用 pipeline() 是最简单的方法。使用你的模型实例化一个用于填充空白的 pipeline，并将你的文本传递给它。如果需要，可以使用 top_k 参数指定要返回的预测数量：

>>> from transformers import pipeline

>>> mask_filler = pipeline("fill-mask", "stevhliu/my_awesome_eli5_mlm_model")
>>> mask_filler(text, top_k=3)
[{'score': 0.5150994658470154,
  'token': 21300,
  'token_str': ' spiral',
  'sequence': 'The Milky Way is a spiral galaxy.'},
 {'score': 0.07087188959121704,
  'token': 2232,
  'token_str': ' massive',
  'sequence': 'The Milky Way is a massive galaxy.'},
 {'score': 0.06434620916843414,
  'token': 650,
  'token_str': ' small',
  'sequence': 'The Milky Way is a small galaxy.'}]

Pytorch

Hide Pytorch content

对文本进行分词并将 input_ids 返回为 PyTorch 张量。你还需要指定 <mask> 标记的位置：

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_eli5_mlm_model")
>>> inputs = tokenizer(text, return_tensors="pt")
>>> mask_token_index = torch.where(inputs["input_ids"] == tokenizer.mask_token_id)[1]

将输入传递给模型并返回被遮盖标记的 logits 值：

>>> from transformers import AutoModelForMaskedLM

>>> model = AutoModelForMaskedLM.from_pretrained("stevhliu/my_awesome_eli5_mlm_model")
>>> logits = model(**inputs).logits
>>> mask_token_logits = logits[0, mask_token_index, :]

然后返回概率最高的三个被屏蔽标记的令牌，并将它们打印出来：

>>> top_3_tokens = torch.topk(mask_token_logits, 3, dim=1).indices[0].tolist()

>>> for token in top_3_tokens:
...     print(text.replace(tokenizer.mask_token, tokenizer.decode([token])))
The Milky Way is a spiral galaxy.
The Milky Way is a massive galaxy.
The Milky Way is a small galaxy.

TensorFlow

Hide TensorFlow content

对文本进行分词，并将 input_ids 返回为 TensorFlow 张量。您还需要指定 <mask> 标记的位置：

>>> from transformers import AutoTokenizer

>>> tokenizer = AutoTokenizer.from_pretrained("stevhliu/my_awesome_eli5_mlm_model")
>>> inputs = tokenizer(text, return_tensors="tf")
>>> mask_token_index = tf.where(inputs["input_ids"] == tokenizer.mask_token_id)[0, 1]

将输入传递给模型，并返回被屏蔽令牌的 logits：

>>> from transformers import TFAutoModelForMaskedLM

>>> model = TFAutoModelForMaskedLM.from_pretrained("stevhliu/my_awesome_eli5_mlm_model")
>>> logits = model(**inputs).logits
>>> mask_token_logits = logits[0, mask_token_index, :]

然后返回概率最高的三个被屏蔽标记的令牌，并将它们打印出来：

>>> top_3_tokens = tf.math.top_k(mask_token_logits, 3).indices.numpy()

>>> for token in top_3_tokens:
...     print(text.replace(tokenizer.mask_token, tokenizer.decode([token])))
The Milky Way is a spiral galaxy.
The Milky Way is a massive galaxy.
The Milky Way is a small galaxy.

初步介绍

教程

特定任务模型训练

自然语言处理 NLP

掩码语言模型 MLM 教程

加载 ELI5 数据集

预处理

训练

推理