问答任务是根据问题返回答案。如果你曾经询问过像 Alexa、Siri 或 Google 这样的虚拟助手天气情况,那么你就使用过问答模型。问答任务通常有两种类型:
本指南将展示如何:
本教程中展示的任务支持以下模型架构:
ALBERT, BART, BERT, BigBird, BigBird-Pegasus, BLOOM, CamemBERT, CANINE, ConvBERT, Data2VecText, DeBERTa, DeBERTa-v2, DistilBERT, ELECTRA, ERNIE, ErnieM, Falcon, FlauBERT, FNet, Funnel Transformer, OpenAI GPT-2, GPT Neo, GPT NeoX, GPT-J, I-BERT, LayoutLMv2, LayoutLMv3, LED, LiLT, Longformer, LUKE, LXMERT, MarkupLM, mBART, MEGA, Megatron-BERT, MobileBERT, MPNet, MRA, MT5, MVP, Nezha, Nyströmformer, OPT, QDQBert, Reformer, RemBERT, RoBERTa, RoBERTa-PreLayerNorm, RoCBert, RoFormer, Splinter, SqueezeBERT, T5, UMT5, XLM, XLM-RoBERTa, XLM-RoBERTa-XL, XLNet, X-MOD, YOSO
开始之前,请确保已安装所有必要的库:
pip install transformers datasets evaluate
我们鼓励您登录您的 Hugging Face 账号,这样您就可以与社区共享和上传您的模型。在提示时,输入您的令牌登录:
>>> from huggingface_hub import notebook_login
>>> notebook_login()
首先,使用🤗 Datasets 库从 SQuAD 数据集中加载一个较小的子集。这样做可以让您有机会进行实验和确保一切正常后再花更多时间在完整的数据集上进行训练。
>>> from datasets import load_dataset
>>> squad = load_dataset("squad", split="train[:5000]")
使用 datasets.Dataset.train_test_split
方法将数据集的 train
部分分割成训练集和测试集:
>>> squad = squad.train_test_split(test_size=0.2)
然后查看一个示例:
>>> squad["train"][0]
{'answers': {'answer_start': [515], 'text': ['Saint Bernadette Soubirous']},
'context': 'Architecturally, the school has a Catholic character. Atop the Main Building's gold dome is a golden statue of the Virgin Mary. Immediately in front of the Main Building and facing it, is a copper statue of Christ with arms upraised with the legend "Venite Ad Me Omnes". Next to the Main Building is the Basilica of the Sacred Heart. Immediately behind the basilica is the Grotto, a Marian place of prayer and reflection. It is a replica of the grotto at Lourdes, France where the Virgin Mary reputedly appeared to Saint Bernadette Soubirous in 1858. At the end of the main drive (and in a direct line that connects through 3 statues and the Gold Dome), is a simple, modern stone statue of Mary.',
'id': '5733be284776f41900661182',
'question': 'To whom did the Virgin Mary allegedly appear in 1858 in Lourdes France?',
'title': 'University_of_Notre_Dame'
}
这里有几个重要的字段:
answers
:答案标记的起始位置和答案文本。context
:需要从中提取答案的上下文信息。question
:模型需要回答的问题。下一步是加载 DistilBERT 分词器来处理 question
和 context
字段:
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("distilbert-base-uncased")
问答任务有一些特定的预处理步骤,您需要注意:
context
可能非常长,超过了模型的最大输入长度。为了处理更长的序列,仅截断 context
,设置 truncation="only_second"
。return_offset_mapping=True
,将答案的起始位置和结束位置映射到原始的 context
。sequence_ids
方法,找到偏移量的哪个部分对应于 question
,哪个部分对应于 context
。下面是一个函数的例子,用于截断和映射 answer
的起始和结束标记到 context
:
>>> def preprocess_function(examples):
... questions = [q.strip() for q in examples["question"]]
... inputs = tokenizer(
... questions,
... examples["context"],
... max_length=384,
... truncation="only_second",
... return_offsets_mapping=True,
... padding="max_length",
... )
...
... offset_mapping = inputs.pop("offset_mapping")
... answers = examples["answers"]
... start_positions = []
... end_positions = []
...
... for i, offset in enumerate(offset_mapping):
... answer = answers[i]
... start_char = answer["answer_start"][0]
... end_char = answer["answer_start"][0] + len(answer["text"][0])
... sequence_ids = inputs.sequence_ids(i)
...
... # 找到 context 的起始和结束位置
... idx = 0
... while sequence_ids[idx] != 1:
... idx += 1
... context_start = idx
... while sequence_ids[idx] == 1:
... idx += 1
... context_end = idx - 1
...
... # 如果答案不完全在 context 中,标记为 (0, 0)
... if offset[context_start][0] > end_char or offset[context_end][1] < start_char:
... start_positions.append(0)
... end_positions.append(0)
... else:
... # 否则,它们是起始和结束的标记位置
... idx = context_start
... while idx <= context_end and offset[idx][0] <= start_char:
... idx += 1
... start_positions.append(idx - 1)
...
... idx = context_end
... while idx >= context_start and offset[idx][1] >= end_char:
... idx -= 1
... end_positions.append(idx + 1)
...
... inputs["start_positions"] = start_positions
... inputs["end_positions"] = end_positions
... return inputs
为了将预处理函数应用到整个数据集,使用🤗 Datasets map
函数。通过将 batched=True
设置为同时处理数据集中的多个元素,可以加快 map
函数的速度。删除您不需要的任何列:
>>> tokenized_squad = squad.map(preprocess_function, batched=True, remove_columns=squad["train"].column_names)
现在,让我们使用 DefaultDataCollator
创建一批示例。与🤗 Transformers 中的其他数据收集器不同,DefaultDataCollator
不会应用任何额外的预处理,例如填充。
>>> from transformers import DefaultDataCollator
>>> data_collator = DefaultDataCollator()
>>> from transformers import DefaultDataCollator
>>> data_collator = DefaultDataCollator(return_tensors="tf")
如果你对使用 Trainer
来对模型进行调优不熟悉,可以参考 基础教程!
你已经准备好了开始训练你的模型了,使用 AutoModelForQuestionAnswering
加载 DistilBERT:
>>> from transformers import AutoModelForQuestionAnswering, TrainingArguments, Trainer
>>> model = AutoModelForQuestionAnswering.from_pretrained("distilbert-base-uncased")
到这里,只差最后三个步骤:
TrainingArguments
中定义训练超参数。唯一必需的参数是output_dir
,用于指定保存模型的位置。通过设置push_to_hub=True
来将此模型推送到Hub(您需要登录Hugging Face以上传模型)。Trainer
。train()
来微调模型。>>> training_args = TrainingArguments(
... output_dir="my_awesome_qa_model",
... evaluation_strategy="epoch",
... learning_rate=2e-5,
... per_device_train_batch_size=16,
... per_device_eval_batch_size=16,
... num_train_epochs=3,
... weight_decay=0.01,
... push_to_hub=True,
... )
>>> trainer = Trainer(
... model=model,
... args=training_args,
... train_dataset=tokenized_squad["train"],
... eval_dataset=tokenized_squad["test"],
... tokenizer=tokenizer,
... data_collator=data_collator,
... )
>>> trainer.train()
训练完成后,使用 push_to_hub()
方法将模型共享到Hub,以便每个人都可以使用您的模型:
>>> trainer.push_to_hub()
如果你对使用 Keras 来对模型进行调优不熟悉,可以参考 基础教程!
要在 TensorFlow 中微调模型,请首先设置优化器函数、学习率和一些训练超参数:
>>> from transformers import create_optimizer
>>> batch_size = 16
>>> num_epochs = 2
>>> total_train_steps = (len(tokenized_squad["train"]) // batch_size) * num_epochs
>>> optimizer, schedule = create_optimizer(
... init_lr=2e-5,
... num_warmup_steps=0,
... num_train_steps=total_train_steps,
... )
然后就可以使用 TFAutoModelForQuestionAnswering
加载 DistilBERT 模型:
>>> from transformers import TFAutoModelForQuestionAnswering
>>> model = TFAutoModelForQuestionAnswering("distilbert-base-uncased")
使用 prepare_tf_dataset()
将您的数据集转换为 tf.data.Dataset
格式:
>>> tf_train_set = model.prepare_tf_dataset(
... tokenized_squad["train"],
... shuffle=True,
... batch_size=16,
... collate_fn=data_collator,
... )
>>> tf_validation_set = model.prepare_tf_dataset(
... tokenized_squad["test"],
... shuffle=False,
... batch_size=16,
... collate_fn=data_collator,
... )
使用 compile
配置您的模型进行训练。
>>> import tensorflow as tf
>>> model.compile(optimizer=optimizer)
在您开始训练之前,最后要做的一件事情是供一种将模型上传到Hub的方式。使用 PushToHubCallback
指定要将模型和分词器推送到的位置:
>>> from transformers.keras_callbacks import PushToHubCallback
>>> callback = PushToHubCallback(
... output_dir="my_awesome_qa_model",
... tokenizer=tokenizer,
... )
最后,您准备好开始微调模型了!使用训练和验证数据集、训练的epochs数量以及回调函数来调用 fit
以微调模型:
>>> model.fit(x=tf_train_set, validation_data=tf_validation_set, epochs=3, callbacks=[callback])
训练完成后,您的模型会自动上传到Hub,以供所有人使用!
要了解如何调优一个问答模型的更深入示例,请参阅 PyTorch notebook or TensorFlow notebook.
问答评估需要大量的后处理。 为了避免占用您太多时间,本指南跳过了评估步骤。 Trainer
仍然会在训练期间计算评估损失,因此您不会对模型的性能完全一无所知。
如果您有更多时间,并且对如何评估问题回答模型感兴趣,请参阅 问答课程 章节中关于如何评估模型的详细示例。
现在您已经微调好了模型,您可以使用它进行推理!
给出一个问题和对应的上下文来开始进行推理预测:
>>> question = "How many programming languages does BLOOM support?"
>>> context = "BLOOM has 176 billion parameters and can generate text in 46 languages natural languages and 13 programming languages."
使用pipeline()
是尝试使用微调后的模型进行预测的最简单方法。通过使用带有模型和文本的问题回答pipeline
实例化一个pipeline
,然后将您的文本传递给它:
>>> from transformers import pipeline
>>> question_answerer = pipeline("question-answering", model="my_awesome_qa_model")
>>> question_answerer(question=question, context=context)
{'score': 0.2058267742395401,
'start': 10,
'end': 95,
'answer': '176 billion parameters and can generate text in 46 languages natural languages and 13'}
您也可以自己手动实现 pipeline
并得到一致的结果:
对文本进行分词并返回 PyTorch 张量:
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_qa_model")
>>> inputs = tokenizer(question, context, return_tensors="pt")
将您的输入传递给模型并返回 logits
:
>>> import torch
>>> from transformers import AutoModelForQuestionAnswering
>>> model = AutoModelForQuestionAnswering.from_pretrained("my_awesome_qa_model")
>>> with torch.no_grad():
... outputs = model(**inputs)
从模型输出中获取最高概率的起始位置和结束位置:
>>> answer_start_index = outputs.start_logits.argmax()
>>> answer_end_index = outputs.end_logits.argmax()
解码预测的 token 以获得回答文本:
>>> predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
>>> tokenizer.decode(predict_answer_tokens)
'176 billion parameters and can generate text in 46 languages natural languages and 13'
对文本进行分词并返回 TensorFlow 张量:
>>> from transformers import AutoTokenizer
>>> tokenizer = AutoTokenizer.from_pretrained("my_awesome_qa_model")
>>> inputs = tokenizer(question, text, return_tensors="tf")
将您的输入传递给模型并返回 logits
:
>>> from transformers import TFAutoModelForQuestionAnswering
>>> model = TFAutoModelForQuestionAnswering.from_pretrained("my_awesome_qa_model")
>>> outputs = model(**inputs)
从模型输出中获取最高概率的起始位置和结束位置:
>>> answer_start_index = int(tf.math.argmax(outputs.start_logits, axis=-1)[0])
>>> answer_end_index = int(tf.math.argmax(outputs.end_logits, axis=-1)[0])
解码预测的 token 以获得回答文本:
>>> predict_answer_tokens = inputs.input_ids[0, answer_start_index : answer_end_index + 1]
>>> tokenizer.decode(predict_answer_tokens)
'176 billion parameters and can generate text in 46 languages natural languages and 13'