语言建模分为两种类型,即因果语言建模和掩码语言建模。本指南介绍的是因果语言建模。 因果语言模型经常用于文本生成。您可以将这些模型用于创意应用,如选择自己的文字冒险游戏或智能编码助手,如 Copilot 或 CodeParrot。
因果语言建模预测一系列标记中的下一个标记,模型只能关注左侧的标记。这意味着模型无法看到未来的标记。GPT-2就是一个因果语言模型的例子。
本指南将向您展示以下内容:
BART, BERT, Bert Generation, BigBird, BigBird-Pegasus, BioGpt, Blenderbot, BlenderbotSmall, BLOOM, CamemBERT, CodeGen, CPM-Ant, CTRL, Data2VecText, ELECTRA, ERNIE, Falcon, GIT, GPT-Sw3, OpenAI GPT-2, GPTBigCode, GPT Neo, GPT NeoX, GPT NeoX Japanese, GPT-J, LLaMA, Marian, mBART, MEGA, Megatron-BERT, MusicGen, MVP, OpenLlama, OpenAI GPT, OPT, Pegasus, PLBart, ProphetNet, QDQBert, Reformer, RemBERT, RoBERTa, RoBERTa-PreLayerNorm, RoCBert, RoFormer, RWKV, Speech2Text2, Transformer-XL, TrOCR, XGLM, XLM, XLM-ProphetNet, XLM-RoBERTa, XLM-RoBERTa-XL, XLNet, X-MOD
开始之前,请确保已安装所有必要的库:
pip install transformers datasets evaluate
我们鼓励您登录Hugging Face帐户,这样您就可以上传和与社区分享您的模型。在提示时,输入您的令牌以登录:
>>> from huggingface_hub import notebook_login
>>> notebook_login()
首先,从🤗 Datasets库中加载 ELI5 数据集中的 r/askscience
的子集。
先在小数据集上进行测试实验,确保在一切正常,然后再花更多时间使用完整数据集进行训练。
>>> from datasets import load_dataset
>>> eli5 = load_dataset("eli5", split="train_asks[:5000]")
使用train_test_split
方法将数据集的train_asks
拆分为训练集和测试集:
>>> eli5 = eli5.train_test_split(test_size=0.2)
然后看一个示例:
>>> eli5["train"][0]
{'answers': {'a_id': ['c3d1aib', 'c3d4lya'],
'score': [6, 3],
'text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.",
"Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"]},
'answers_urls': {'url': []},
'document': '',
'q_id': 'nyxfp',
'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?',
'selftext_urls': {'url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg']},
'subreddit': 'askscience',
'title': 'Few questions about this space walk photograph.',
'title_urls': {'url': []}}
尽管这看起来很多,但您实际上只要关注 text
字段。语言建模的有趣之处在于您不需要标签(也称为无监督任务),因为下一个单词(即标记)就是标签。
下一步是加载 DistilGPT2 的分词器来处理 text
字段:
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("distilgpt2")
从上面的示例中可以看出,text
字段实际上是嵌套在 answers
中的。这意味着您需要使用 flatten
方法从嵌套结构中提取text
子字段:
>>> eli5 = eli5.flatten()
>>> eli5["train"][0]
{'answers.a_id': ['c3d1aib', 'c3d4lya'],
'answers.score': [6, 3],
'answers.text': ["The velocity needed to remain in orbit is equal to the square root of Newton's constant times the mass of earth divided by the distance from the center of the earth. I don't know the altitude of that specific mission, but they're usually around 300 km. That means he's going 7-8 km/s.\n\nIn space there are no other forces acting on either the shuttle or the guy, so they stay in the same position relative to each other. If he were to become unable to return to the ship, he would presumably run out of oxygen, or slowly fall into the atmosphere and burn up.",
"Hope you don't mind me asking another question, but why aren't there any stars visible in this photo?"],
'answers_urls.url': [],
'document': '',
'q_id': 'nyxfp',
'selftext': '_URL_0_\n\nThis was on the front page earlier and I have a few questions about it. Is it possible to calculate how fast the astronaut would be orbiting the earth? Also how does he stay close to the shuttle so that he can return safely, i.e is he orbiting at the same speed and can therefore stay next to it? And finally if his propulsion system failed, would he eventually re-enter the atmosphere and presumably die?',
'selftext_urls.url': ['http://apod.nasa.gov/apod/image/1201/freeflyer_nasa_3000.jpg'],
'subreddit': 'askscience',
'title': 'Few questions about this space walk photograph.',
'title_urls.url': []}
现在,每个子字段都是单独的列,如answers
前缀所示,而text
字段现在是一个列表。与其分别对每个句子进行分词,不如将列表转换为一个字符串,然后一起进行分词。
这是一个用于连接每个示例的字符串列表并对结果进行分词的预处理函数:
>>> def preprocess_function(examples):
... return tokenizer([" ".join(x) for x in examples["answers.text"]])
要在整个数据集上应用此预处理函数,请使用 🤗 Datasets map
方法。您可以通过将 batched=True
设置来加速 map
函数,以同时处理数据集的多个元素,并可以通过 num_proc
增加进程的数量。删除您不需要的任何列:
>>> tokenized_eli5 = eli5.map(
... preprocess_function,
... batched=True,
... num_proc=4,
... remove_columns=eli5["train"].column_names,
... )
该数据集包含令牌序列,但其中一些序列的长度超过了模型的最大输入长度。
现在,您可以使用第二个预处理函数将其截断或填充为固定的长度,或者简单地忽略超过最大长度的标记化结果。
block_size
定义的长度进行分割,该长度应该小于最大输入长度,并且应该适合你的GPU内存。>>> block_size = 128
>>> def group_texts(examples):
... # 连接所有文本。
... concatenated_examples = {k: sum(examples[k], []) for k in examples.keys()}
... total_length = len(concatenated_examples[list(examples.keys())[0]])
... # 我们丢弃了剩余的部分,如果模型支持填充而不是丢弃,可以自定义此部分以满足您的需求。
... if total_length >= block_size:
... total_length = (total_length // block_size) * block_size
... # 按 block_size 切分。
... result = {
... k: [t[i : i + block_size] for i in range(0, total_length, block_size)]
... for k, t in concatenated_examples.items()
... }
... result["labels"] = result["input_ids"].copy()
... return result
将 group_texts
函数应用于整个数据集:
>>> lm_dataset = tokenized_eli5.map(group_texts, batched=True, num_proc=4)
现在使用 DataCollatorForLanguageModeling
创建一个示例批次。在整理过程中,将句子动态填充到批次中的最长长度,而不是将整个数据集填充到最大长度。
将结束序列标记(EOS 标记)用作填充标记,并设置 mlm=False
。这将使用右移一个元素的输入作为标签:
>>> from transformers import DataCollatorForLanguageModeling
>>> tokenizer.pad_token = tokenizer.eos_token
>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
将结束序列标记(EOS 标记)用作填充标记,并设置 mlm=False
。这将使用右移一个元素的输入作为标签:
>>> from transformers import DataCollatorForLanguageModeling
>>> data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False, return_tensors="tf")
如果您对使用 Trainer
进行微调的模型不熟悉,请参阅 基础教程 !
>>> from transformers import AutoModelForCausalLM, TrainingArguments, Trainer
>>> model = AutoModelForCausalLM.from_pretrained("distilgpt2")
此时只剩下三个步骤:
TrainingArguments
。唯一必需的参数是 output_dir
,用于指定模型保存位置。将 push_to_hub=True
设置为上传模型到 Hugging Face Hub(需要登录 Hugging Face)。Trainer
。train()
进行微调。>>> training_args = TrainingArguments(
... output_dir="my_awesome_eli5_clm-model",
... evaluation_strategy="epoch",
... learning_rate=2e-5,
... weight_decay=0.01,
... push_to_hub=True,
... )
>>> trainer = Trainer(
... model=model,
... args=training_args,
... train_dataset=lm_dataset["train"],
... eval_dataset=lm_dataset["test"],
... data_collator=data_collator,
... )
>>> trainer.train()
训练完成后,使用 evaluate()
方法评估模型并获取 perplexity:
>>> import math
>>> eval_results = trainer.evaluate()
>>> print(f"Perplexity: {math.exp(eval_results['eval_loss']):.2f}")
Perplexity: 49.61
然后,使用 push_to_hub()
方法将模型分享至 Hub,以供其他人使用:
>>> trainer.push_to_hub()
如果您对使用 Keras 进行微调的模型不熟悉,请参阅 基础教程 !
>>> from transformers import create_optimizer, AdamWeightDecay
>>> optimizer = AdamWeightDecay(learning_rate=2e-5, weight_decay_rate=0.01)
然后可以使用 TFAutoModelForCausalLM
加载 DistilGPT2:
>>> from transformers import TFAutoModelForCausalLM
>>> model = TFAutoModelForCausalLM.from_pretrained("distilgpt2")
使用 prepare_tf_dataset()
将数据集转换为 tf.data.Dataset
格式:
>>> tf_train_set = model.prepare_tf_dataset(
... lm_dataset["train"],
... shuffle=True,
... batch_size=16,
... collate_fn=data_collator,
... )
>>> tf_test_set = model.prepare_tf_dataset(
... lm_dataset["test"],
... shuffle=False,
... batch_size=16,
... collate_fn=data_collator,
... )
使用 compile
配置模型进行训练。请注意,Transformers 模型都有一个默认的与任务相关的损失函数,因此除非您想要指定其他损失函数,否则不需要指定:
>>> import tensorflow as tf
>>> model.compile(optimizer=optimizer) # 不需要指定损失参数!
可以通过在 PushToHubCallback
中指定模型和令牌化器的位置来完成这一操作:
>>> from transformers.keras_callbacks import PushToHubCallback
>>> callback = PushToHubCallback(
... output_dir="my_awesome_eli5_clm-model",
... tokenizer=tokenizer,
... )
最后,您已经准备好开始训练模型了!使用训练和验证数据集、训练的轮数以及回调函数来调用 fit
进行微调:
>>> model.fit(x=tf_train_set, validation_data=tf_test_set, epochs=3, callbacks=[callback])
训练完成后,您的模型会自动上传到 Hub,以供他人使用!
有关如何对定向语言建模进行更详细示例的方法,请查看相应的 PyTorch 笔记本 或 TensorFlow 笔记本 。
很棒!既然您已经微调了模型,就可以用它进行推理了!
构思一个您想要从中生成文本的提示:
>>> prompt = "Somatic hypermutation allows the immune system to"
使用 pipeline()
是在推理中使用微调后的模型最简单的方法。使用您的模型实例化一个文本生成的 pipeline
,并将文本传递给它:
>>> from transformers import pipeline
>>> generator = pipeline("text-generation", model="my_awesome_eli5_clm-model")
>>> generator(prompt)
[{'generated_text': "Somatic hypermutation allows the immune system to be able to effectively reverse the damage caused by an infection.\n\n\nThe damage caused by an infection is caused by the immune system's ability to perform its own self-correcting tasks."}]
将文本进行 tokenize 分词并将 input_ids
作为 PyTorch 张量返回:
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_eli5_clm-model")
>>> inputs = tokenizer(prompt, return_tensors="pt").input_ids
使用generate()
方法生成文本。有关不同的文本生成策略和控制生成的参数的更多详细信息,请参阅 文本生成策略 页面。
>>> from transformers import AutoModelForCausalLM
>>> model = AutoModelForCausalLM.from_pretrained("my_awesome_eli5_clm-model")
>>> outputs = model.generate(inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)
将生成的 token 标记 ID 解码为文本:
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
["Somatic hypermutation allows the immune system to react to drugs with the ability to adapt to a different environmental situation. In other words, a system of 'hypermutation' can help the immune system to adapt to a different environmental situation or in some cases even a single life. In contrast, researchers at the University of Massachusetts-Boston have found that 'hypermutation' is much stronger in mice than in humans but can be found in humans, and that it's not completely unknown to the immune system. A study on how the immune system"]
将文本进行 tokenize 分词并将,并将input_ids
以 TensorFlow 张量的形式返回:
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_eli5_clm-model")
>>> inputs = tokenizer(prompt, return_tensors="tf").input_ids
使用generate()
方法生成摘要。有关不同的文本生成策略和控制生成的参数的更多详细信息,请参阅 文本生成策略 页面。
>>> from transformers import TFAutoModelForCausalLM
>>> model = TFAutoModelForCausalLM.from_pretrained("my_awesome_eli5_clm-model")
>>> outputs = model.generate(input_ids=inputs, max_new_tokens=100, do_sample=True, top_k=50, top_p=0.95)
将生成的标记ID解码为文本:
>>> tokenizer.batch_decode(outputs, skip_special_tokens=True)
['Somatic hypermutation allows the immune system to detect the presence of other viruses as they become more prevalent. Therefore, researchers have identified a high proportion of human viruses. The proportion of virus-associated viruses in our study increases with age. Therefore, we propose a simple algorithm to detect the presence of these new viruses in our samples as a sign of improved immunity. A first study based on this algorithm, which will be published in Science on Friday, aims to show that this finding could translate into the development of a better vaccine that is more effective for']