Pytorch fine-tuning BERT to implement named entity recognition

Environmental preparation

Before proceeding, make sure you have PyTorch, Transformers by hugs Face and other necessary Python libraries installed:

pip install torch transformers datasets

Loading pretrained BERT model

First import the required modules and load the pretrained BERT model. We use the "bert-base-case" model as a starting point:

from transformers import BertTokenizer, BertForTokenClassification
import torch

# Load pre-trained model tokenizer (vocabulary)
tokenizer = BertTokenizer.from_pretrained('bert-base-cased')

# Load pre-trained model for token classification
model = BertForTokenClassification.from_pretrained('bert-base-cased', num_labels=9)

Here, num_labels represents the number of entity types we want to classify. For simple NER tasks, this may include tags such as PERSON, ORGANIZATION, LOCATION, etc.

Prepare the dataset

We will use the hug Face dataset library to load the dataset. We will demonstrate this with the ‘conll2003’ dataset:

from datasets import load_dataset

dataset = load_dataset("conll2003")

CoNLL-2003The dataset contains word, part-of-speech tags, syntactic block tags, and named entity tags. For NER tasks, we are interested in the “NER” tag. It's classicNamed Entity Recognition (NER)Dataset. The following is a detailed introduction:

1. Dataset Background

Full name: Conference on Neural Information Processing Systems (CoNLL) 2003 Task
field: Natural Language Processing (NLP)
Task: Named Entity Recognition (NER)
language: English (English)
Organizer: CoNLL meeting (Hugging Faceload_datasetIt has been integrated into the platform)

2. Core content

Label entity type

The following 4 entity types are defined in the dataset:

PER(name, such as "John Smith")
ORG(Organizations, such as "Google")
LOC(Geographical location, such as “New York”)
DATE(Date, such as "2023-10-05")

Data format

structure: Each data is a sentence, a list of words divided by spaces, and each word comes with a corresponding label.
Example:

[
  {"word": "John", "label": "PER"},
  {"word": "works", "label": "O"},
  {"word": "at", "label": "O"},
  {"word": "Google", "label": "ORG"},
  ...
]

List name: text(Original sentence)words(List of words after word segmentation),labels(Entity tag).

Dataset division

Training set: ~14,000 sentences
Verification Set: ~3,000 sentences
Test set: ~3,000 sentences

3. Application scenarios

Training NER models: Such as RNN, LSTM, Transformer (BERT, etc.).
Evaluate model performance: The official provides benchmark results (such as F1 scores) that can be used to compare model effects.
Research NLP Tasks: Analyze the difficulties in entity recognition (such as ambiguity, composite entities).

Things to note

Labeling standards: Tags asORepresents non-entity, others are specific entity types.
Data Scale: Compared with modern datasets (such as OntoNotes), the number of sentences and words is smaller, making it suitable for fast verification models.
Extensibility: Can be associated with other NER datasets (such asconll2000、nerdb) Use in combination to improve model generalization capabilities.

Marking and Alignment

Before processing data via BERT, it is crucial to correctly tag and manage chunk tokenization, which includes correctly aligning the entity tags. Here is how we tokenize and align on the dataset:

def tokenize_and_align_labels(examples):
    tokenized_inputs = 
    tokenizer(examples["tokens"], 
              truncation=True, 
              padding="max_length", 
              is_split_into_words=True)
    labels = []
    for i, label in enumerate(examples["ner"]):
        word_ids = tokenized_inputs.word_ids(batch_index=i)
        previous_word_idx = None
        label_ids = []
        for word_idx in word_ids:
            if word_idx is None:
                label_ids.append(-100)
            elif word_idx != previous_word_idx:
                label_ids.append(label[word_idx])
            else:
                label_ids.append(-100)
            previous_word_idx = word_idx
        (label_ids)
    tokenized_inputs["labels"] = labels
    return tokenized_inputs

# Apply tokenizer
encoded_dataset = (tokenize_and_align_labels, batched=True)

The purpose of this code is toNamed Entity Identification (NER) DataPerform tokenization and label alignment to adapt it to the input format of pre-trained models (such as BERT, RoBERTa, etc.). The following are line by line explanation and core logic analysis:

Enter examples contains two key fields:
- "tokens": A list of words for the original sentence (such as[["John", "works", "at"], ...]）。
- "ner": The corresponding entity tag list (such as[["PER", "O", "O"], ...]）。
Output:
- tokenized_inputs: Model input after word segmentation (includinginput_ids, attention_maskwait).
- labels: Labels aligned with the model output (map the original label to the subword position after the participle).

-100 Tags are used to mask tags during training and correspond to tags that must be skipped during the loss calculation process.

Fine-tuning BERT

Let's set up the data loader using the PyTorch framework and define the training and evaluation functions. This process involves configuring the optimizer, setting the learning rate, and establishing a fine-tuning loop:

from  import DataLoader
from transformers import AdamW

train_dataset = encoded_dataset["train"]
train_dataloader = DataLoader(train_dataset, shuffle=True, batch_size=16)

optimizer = AdamW((), lr=5e-5)

()
for epoch in range(3):  # loop over the dataset multiple times
    for batch in train_dataloader:
        inputs = {k: (device) for k, v in () if k != "labels"}
        labels = batch["labels"].to(device)
        outputs = model(**inputs, labels=labels)
        loss = 
        ()
        ()
        optimizer.zero_grad()

Training usually requires special attention to GPU utilization. If you have the criteria to use the GPU, make sure your model, input data, and optimizer are migrated to the GPU.

Final summary

Fine-tuning BERT for named entity recognition in PyTorch involves a series of steps, starting from loading a pre-trained BERT word segmenter and model, to preparing the dataset, training, and finally using the trained model to identify the named entity. With the right data set and proper model tuning, this technology allows you to apply to a variety of real-life scenarios using state-of-the-art NLP architectures.

The above is the detailed content of Pytorch fine-tuning BERT to implement named entity recognition. For more information about Pytorch BERT named entity recognition, please pay attention to my other related articles!