SoFunction
Updated on 2024-10-30

Tensorflow 2.10 Using BERT to Extract Answers from Text Implementation Details

preamble

This paper details the implementation of a simple process for extracting answers from text using tensorflow-gpu version 2.10.

Data preparation

This is primarily used to prepare the data and tools needed to train and evaluate Bert models for the SQuAD (Standford Question Answering Dataset) dataset.

First, the data was prepared for subsequent processing and model building by importing relevant libraries, including os, re, json, string, numpy, tensorflow, tokenizers and transformers. Then, the maximum length was set to 384 and a BertConfig object was created. Next, the tokenizer for the pre-trained model bert-base-uncased model was downloaded from the Hugging Face model repository and saved to a folder named bert_base_uncased in the same directory. When the download is complete, use the BertWordPieceTokenizer to create a participant tokenizer from the vocabulary that is clipped to the tokenizer in the downloaded folder.

The rest of the process is to download the training and validation sets from the specified URLs and save them locally using .get_file(), typically under the "user directory .keras\datasets" for subsequent data preprocessing and model training.

import os
import re
import json
import string
import numpy as np
import tensorflow as tf
from tensorflow import keras
from  import layers
from tokenizers import BertWordPieceTokenizer
from transformers import BertTokenizer, TFBertModel, BertConfig
max_len = 384
configuration = BertConfig() 
slow_tokenizer = BertTokenizer.from_pretrained("bert-base-uncased")
save_path = "bert_base_uncased/"
if not (save_path):
    (save_path)
slow_tokenizer.save_pretrained(save_path)
tokenizer = BertWordPieceTokenizer("bert_base_uncased/", lowercase=True)
train_data_url = "/SQuAD-explorer/dataset/train-v1."
train_path = .get_file("", train_data_url)
eval_data_url = "/SQuAD-explorer/dataset/dev-v1."
eval_path = .get_file("", eval_data_url)

Print:

Downloading data from /SQuAD-explorer/dataset/train-v1.
30288272/30288272 [==============================] - 131s 4us/step
Downloading data from /SQuAD-explorer/dataset/dev-v1.
4854279/4854279 [==============================] - 20s 4us/step

Model input and output processing

A class named SquadExample is defined here to represent a question in an SQuAD dataset and the corresponding context snippet, answer location, and other information.

The class constructor __init__() accepts five arguments: question, context, start_char_idx, answer_text, and all_answers.

The class also includes a method called preprocess() that is used to preprocess each SQuAD sample by first preprocessing context, question, and answer and computing the end_char_idx of the answer. Next, a list is_char_in_ans is constructed that indicates which characters in the context belong to the answer based on the position of start_char_idx and end_char_idx in the context. Then, the context is encoded using a tokenizer to get tokenized_context.

Next, the indexed list of tokens containing the answer, ans_token_idx, is obtained by comparing the character position of answer with the character position of each token in the context. If answer is not found in the context, the skip attribute is set to True and an empty result is returned.

Finally, the sequences of context and question are spliced into the input sequence input_ids, and depending on the difference between the two sentences, a sequence token_type_ids of the same length and an attention_mask of the same length as input_ids are generated. These three sequences are then padded.

class SquadExample:
    def __init__(self, question, context, start_char_idx, answer_text, all_answers):
         = question
         = context
        self.start_char_idx = start_char_idx
        self.answer_text = answer_text
        self.all_answers = all_answers
         = False
    def preprocess(self):
        context = 
        question = 
        answer_text = self.answer_text
        start_char_idx = self.start_char_idx
        context = " ".join(str(context).split())
        question = " ".join(str(question).split())
        answer = " ".join(str(answer_text).split())
        end_char_idx = start_char_idx + len(answer)
        if end_char_idx >= len(context):
             = True
            return
        is_char_in_ans = [0] * len(context)
        for idx in range(start_char_idx, end_char_idx):
            is_char_in_ans[idx] = 1
        tokenized_context = (context)
        ans_token_idx = []
        for idx, (start, end) in enumerate(tokenized_context.offsets):
            if sum(is_char_in_ans[start:end]) > 0:
                ans_token_idx.append(idx)
        if len(ans_token_idx) == 0:
             = True
            return
        start_token_idx = ans_token_idx[0]
        end_token_idx = ans_token_idx[-1]
        tokenized_question = (question)
        input_ids = tokenized_context.ids + tokenized_question.ids[1:]
        token_type_ids = [0] * len(tokenized_context.ids) + [1] * len(tokenized_question.ids[1:])
        attention_mask = [1] * len(input_ids)
        padding_length = max_len - len(input_ids)
        if padding_length > 0:   
            input_ids = input_ids + ([0] * padding_length)
            attention_mask = attention_mask + ([0] * padding_length)
            token_type_ids = token_type_ids + ([0] * padding_length)
        elif padding_length < 0:  
             = True
            return
        self.input_ids = input_ids
        self.token_type_ids = token_type_ids
        self.attention_mask = attention_mask
        self.start_token_idx = start_token_idx
        self.end_token_idx = end_token_idx
        self.context_token_to_char = tokenized_context.offsets

The two functions here are used to prepare data to train a question and answer model using the BERT structure.

The first function, create_squad_examples, takes the raw data of a JSON file and turns each piece of data in it into the input format defined by the SquadExample class.

The second function, create_inputs_targets, converts the list of SquadExample objects into inputs and targets for the model. This function returns two lists, one for the inputs of the model, containing input_ids, token_type_ids, attention_mask, and the other for the targets of the model, containing start_token_idx, end_token_idx.

def create_squad_examples(raw_data):
    squad_examples = []
    for item in raw_data["data"]:
        for para in item["paragraphs"]:
            context = para["context"]
            for qa in para["qas"]:
                question = qa["question"]
                answer_text = qa["answers"][0]["text"]
                all_answers = [_["text"] for _ in qa["answers"]]
                start_char_idx = qa["answers"][0]["answer_start"]
                squad_eg = SquadExample(question, context, start_char_idx, answer_text, all_answers)
                squad_eg.preprocess()
                squad_examples.append(squad_eg)
    return squad_examples
def create_inputs_targets(squad_examples):
    dataset_dict = {
        "input_ids": [],
        "token_type_ids": [],
        "attention_mask": [],
        "start_token_idx": [],
        "end_token_idx": [],
    }
    for item in squad_examples:
        if  == False:
            for key in dataset_dict:
                dataset_dict[key].append(getattr(item, key))
    for key in dataset_dict:
        dataset_dict[key] = (dataset_dict[key])
    x = [ dataset_dict["input_ids"], dataset_dict["token_type_ids"], dataset_dict["attention_mask"], ]
    y = [dataset_dict["start_token_idx"], dataset_dict["end_token_idx"]]
    return x, y	

Here the JSON files for the SQuAD training and validation sets are mainly read and the raw data is converted into a list of SquadExample objects using the create_squad_examples function. These lists of SquadExample objects were then converted into model inputs and target outputs using the create_inputs_targets function. The final output prints the number of training data samples and evaluation data samples that have been created.

with open(train_path) as f:
    raw_train_data = (f)
with open(eval_path) as f:
    raw_eval_data = (f)
train_squad_examplesa = create_squad_examples(raw_train_data)
x_train, y_train = create_inputs_targets(train_squad_examples)
print(f"{len(train_squad_examples)} training points created.")
eval_squad_examples = create_squad_examples(raw_eval_data)
x_eval, y_eval = create_inputs_targets(eval_squad_examples)
print(f"{len(eval_squad_examples)} evaluation points created.")

Print:

87599 training points created.
10570 evaluation points created.

model building

Here a BERT based Q&A model is defined. In the create_model() function, the pre-trained BERT model is first loaded using the TFBertModel.from_pretrained() method. Then three input layers (input_ids, token_type_ids, and attention_mask) are created, each with the shape (max_len,). These input layers are used to receive input data from the model.

Next, the input is encoded using the encoder() method to obtain embedding, and the vector representations are then subjected to fully connected layer operations to obtain a start_logits and an end_logits respectively. The two vectors are then flattened and passed to the activation function softmax to obtain a start_probs vector and an end_probs vector.

Finally, a model is constructed by passing these three input layers and these two output layers to the () function. This model is compiled using the SparseCategoricalCrossentropy loss function and trained using the Adam optimizer.

def create_model():
    encoder = TFBertModel.from_pretrained("bert-base-uncased")
    input_ids = (shape=(max_len,), dtype=tf.int32)
    token_type_ids = (shape=(max_len,), dtype=tf.int32)
    attention_mask = (shape=(max_len,), dtype=tf.int32)
    embedding = encoder(input_ids, token_type_ids=token_type_ids, attention_mask=attention_mask)[0]
    start_logits = (1, name="start_logit", use_bias=False)(embedding)
    start_logits = ()(start_logits)
    end_logits = (1, name="end_logit", use_bias=False)(embedding)
    end_logits = ()(end_logits)
    start_probs = ()(start_logits)
    end_probs = ()(end_logits)
    model = ( inputs=[input_ids, token_type_ids, attention_mask],  outputs=[start_probs, end_probs],)
    loss = (from_logits=False)
    optimizer = (lr=5e-5)
    (optimizer=optimizer, loss=[loss, loss])
    return model

The main thing here is to show the architecture of the model, and you can see that all the parameters can be trained, and the main parts of the tuning are pretty much the parameters in bert.

model = create_model()
()

Print:

Model: "model_1"
__________________________________________________________________________________________________
 Layer (type)                   Output Shape         Param #     Connected to                     
==================================================================================================
 input_4 (InputLayer)           [(None, 384)]        0           []                               
 input_6 (InputLayer)           [(None, 384)]        0           []                               
 input_5 (InputLayer)           [(None, 384)]        0           []                               
 tf_bert_model_1 (TFBertModel)  TFBaseModelOutputWi  109482240   ['input_4[0][0]',                
                                thPoolingAndCrossAt               'input_6[0][0]',                
                                tentions(last_hidde               'input_5[0][0]']                
                                n_state=(None, 384,                                               
                                 768),                                                            
                                 pooler_output=(Non                                               
                                e, 768),                                                          
                                 past_key_values=No                                               
                                ne, hidden_states=N                                               
                                one, attentions=Non                                               
                                e, cross_attentions                                               
                                =None)                                                            
 start_logit (Dense)            (None, 384, 1)       768         ['tf_bert_model_1[0][0]']        
 end_logit (Dense)              (None, 384, 1)       768         ['tf_bert_model_1[0][0]']        
 flatten_2 (Flatten)            (None, 384)          0           ['start_logit[0][0]']            
 flatten_3 (Flatten)            (None, 384)          0           ['end_logit[0][0]']              
 activation_2 (Activation)      (None, 384)          0           ['flatten_2[0][0]']              
 activation_3 (Activation)      (None, 384)          0           ['flatten_3[0][0]']              
==================================================================================================
Total params: 109,483,776
Trainable params: 109,483,776
Non-trainable params: 0

Custom Validation Callback Functions

A callback function ExactMatch is defined here with an initialization method __init__ that receives the inputs of the validation set and the targets x_eval and y_eval. The class also implements the on_epoch_end method, which is called at the end of each epoch to compute the model's predictions and calculate the exact match score.

Specifically, the on_epoch_end method first predicts x_eval using the model, obtains the predicted start position pred_start and end position pred_end, and further finds the corresponding predicted and correct answers normalized to normalized_pred_ans and normalized_true_ans. If the former exists over the latter, then the sample was answered correctly and the exact match score is printed out.

def normalize_text(text):
    text = ()
    exclude = set()
    text = "".join(ch for ch in text if ch not in exclude)
    regex = (r"\b(a|an|the)\b", )
    text = (regex, " ", text)
    text = " ".join(())
    return text
class ExactMatch():
    def __init__(self, x_eval, y_eval):
        self.x_eval = x_eval
        self.y_eval = y_eval
    def on_epoch_end(self, epoch, logs=None):
        pred_start, pred_end = (self.x_eval)
        count = 0
        eval_examples_no_skip = [_ for _ in eval_squad_examples if _.skip == False]
        for idx, (start, end) in enumerate(zip(pred_start, pred_end)):
            squad_eg = eval_examples_no_skip[idx]
            offsets = squad_eg.context_token_to_char
            start = (start)
            end = (end)
            if start >= len(offsets):
                continue
            pred_char_start = offsets[start][0]
            if end < len(offsets):
                pred_char_end = offsets[end][1]
                pred_ans = squad_eg.context[pred_char_start:pred_char_end]
            else:
                pred_ans = squad_eg.context[pred_char_start:]
            normalized_pred_ans = normalize_text(pred_ans)
            normalized_true_ans = [normalize_text(_) for _ in squad_eg.all_answers]
            if normalized_pred_ans in normalized_true_ans:
                count += 1
        acc = count / len(self.y_eval[0])
        print(f"\nepoch={epoch+1}, exact match score={acc:.2f}")

Model training and validation

The model is trained and the performance of the model is tested using the validation set. The epoch here is only set to 1, but it would be better if the value is increased.

exact_match_callback = ExactMatch(x_eval, y_eval)
( x_train,  y_train, epochs=1,    verbose=2,  batch_size=16, callbacks=[exact_match_callback],)

Print:

23/323 [==============================] - 47s 139ms/step

epoch=1, exact match score=0.77
5384/5384 - 1268s - loss: 2.4677 - activation_2_loss: 1.2876 - activation_3_loss: 1.1800 - 1268s/epoch - 236ms/step

The above is Tensorflow2.10 using BERT to extract answers from text to achieve details, more information about Tensorflow BERT text extraction please pay attention to my other related articles!