Python: Methods for Chinese sentiment analysis using Bert

Preface

In the field of natural language processing (NLP), sentiment analysis is a very common and important application. Sentiment analysis is often used to identify emotions in texts, such as judging whether a Weibo post or comment is positive, negative, or neutral. In the past few years, with the development of deep learning, the BERT (Bidirectional Encoder Representations from Transformers) model has quickly become a powerful tool for dealing with natural language. BERT is a pre-trained model based on the Transformer architecture that captures the context information of text and achieves top performance in many NLP tasks.

This article will lead newbies to use the BERT model for Chinese sentiment analysis, and will explain in detail how to load open source data sets, train models, evaluate accuracy, and ultimately export the model for future use.

What is BERT?

BERT is a pre-trained language model proposed by Google in 2018. It is pre-trained on a large amount of text data and can be widely used in various natural language processing tasks, such as text classification, question and answer, translation, etc. BERT is modeled through a bidirectional Transformer, which means it is able to understand the semantics of sentences from left to right and right to left at the same time, thus capturing richer contextual information.

One of the biggest features of BERT is that it is a pre-trained model that can be pre-trained on a wide range of common corpus. Then, for specific tasks (such as sentiment analysis), BERT can be fine-tuning to better adapt to the task.

Introduction to Chinese sentiment analysis tasks

Sentiment Analysis, also known as opinion mining, is an analysis of the subjectivity of a piece of text, aiming to judge the emotional tendencies of the text. For Chinese sentiment analysis, our goal is to judge its emotional categories based on the input Chinese text, such as "positive", "negative" or "neutral".

Steps Overview

Prepare the environment: Install the required libraries such as PyTorch and Transformers.
Load Chinese BERT pretrained model: Use the BERT Chinese pre-trained model provided by Huggingface.
Load Chinese sentiment analysis dataset: Use open source datasets such as ChnSentiCorp.
Data preprocessing: Participle and encode the text.
Training the model: Fine-tuning using a pre-trained BERT model.
Evaluate model performance: Evaluate the accuracy of the model on the test set.
Export the model: Save the trained model for easy use in the future.

Step 1: Prepare the environment

First, we need to install some necessary libraries. This article uses PyTorch and Huggingfacetransformerslibrary to implement the BERT model.

Open a terminal or command line and run the following command to install these libraries:

pip install torch transformers datasets

torchIt is the core library of PyTorch, used to build and train neural networks.
transformersIt is an NLP library provided by Huggingface that contains many pre-trained language models, including BERT.
datasetsIt is a dataset tool provided by Huggingface, which can easily load various datasets.

Step 2: Load the Chinese BERT pretrained model

Huggingface provides many pre-trained Chinese BERT models. We will usebert-base-chinese, it has been pre-trained on a large number of Chinese corpus and is suitable for further fine-tuning.

First, import the required modules and load the model and word participle:

from transformers import BertTokenizer, BertForSequenceClassification
# Load BERT Chinese pretrained model and word participletokenizer = BertTokenizer.from_pretrained('bert-base-chinese')
model = BertForSequenceClassification.from_pretrained('bert-base-chinese', num_labels=2)

We used it hereBertForSequenceClassification, it adds a classification layer on the basis of BERT to handle classification tasks.num_labels=2Indicates that we have two emotional labels (positive and negative).

Step 3: Load the Chinese sentiment analysis dataset

For Chinese sentiment analysis tasks, one of the commonly used open source data sets isChnSentiCorp. This dataset contains a large number of Chinese comments and labels the emotional categories (positive or negative) of each comment.

We can use HuggingfacedatasetsThe library directly loads this dataset:

from datasets import load_dataset
# Load the ChnSentiCorp datasetdataset = load_dataset('chnsenticorp')

The loaded dataset contains three parts:train、validationandtest, used for training, verification and testing respectively.

You can view the sample dataset:

print(dataset['train'][0])

The output results are similar to the following:

{
'text': 'The hotel's environment is good and the service is also good, worth recommending! ',
'label': 1
}

textExpress comment content,labelIndicates emotional labels, 1 indicates positive emotions, and 0 indicates negative emotions.

Step 4: Data Preprocessing

Before entering the data into the model, we need to segment the text and encode it. The BERT model uses a specific word participle to divide sentences into chunks and converts them into numerical encodings that the model can understand.

def tokenize_function(examples):
    return tokenizer(examples['text'], padding='max_length', truncation=True)
# Partialize and encode the datasetencoded_dataset = (tokenize_function, batched=True)

padding='max_length'Indicates that all sentences will be filled to the same maximum length,truncation=TrueSentences that exceed the maximum length will be truncated.

Step 5: Training the model

Now that we have the training data ready, we can start fine-tuning the BERT model. First, we define the training parameters and useTrainerCarry out training.

from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
    output_dir='./results',
    num_train_epochs=3,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=16,
    evaluation_strategy="epoch",
    logging_dir='./logs',
)
trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=encoded_dataset['train'],
    eval_dataset=encoded_dataset['validation'],
)
# Start training()

Herenum_train_epochs=3Indicates that we will train 3 rounds with a batch size of 16 for each device.

Step 6: Evaluate Model Performance

After training is complete, we can use the test set to evaluate the accuracy of the model:

# Evaluate the model on the test setresults = (eval_dataset=encoded_dataset['test'])
print(results)

The output results will contain indicators such as the accuracy of the model on the test set, loss, etc.

Assuming that the accuracy of the model reaches 85% on the test set, we can think that this model can already identify emotions in Chinese texts well.

Step 7: Export the model

After training is complete, we can save the model for later use or deployment to production:

# Save the model and word participlemodel.save_pretrained('./sentiment_model')
tokenizer.save_pretrained('./sentiment_model')

After saving the model locally, you can load the model and infer it in the future with the following code:

from transformers import pipeline
# Load the trained model and word participleclassifier = pipeline('sentiment-analysis', model='./sentiment_model', tokenizer='./sentiment_model')
# Make emotional predictionsresult = classifier("This product is really good!")
print(result)

The output may be similar to:

[{'label': 'POSITIVE', 'score': 0.98}]

This suggests that the model predicts this review as positive emotion with a confidence level of 98%.

Summarize

This article leads you to use BERT to perform Chinese sentiment analysis from scratch, and introduces how to load pre-trained BERT models, process open source data sets, train models, evaluate model performance, and finally export the model. Through this article, you should have a preliminary understanding of sentiment analysis using BERT and be able to implement a simple sentiment analysis system.

Summary of key points:

BERT is currently a very powerful pre-trained language model, can be used for various natural language processing tasks through fine-tuning.
Sentiment AnalysisIt is a classic application in the field of NLP, and BERT is well qualified for this task.
Open Source Tools(As provided by Huggingfacetransformersanddatasetslibrary) makes training and using BERT simple and fast.

In the future, we can try to use more emotional categories, larger data sets, and even combine field-specific data for training to further improve the performance of the model.

This is the end of this article about how to use Bert to perform Chinese sentiment analysis in Python. For more related Python Bert Chinese sentiment content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!