SoFunction
Updated on 2025-04-20

Detailed explanation of the steps and code for building NLP models in Python

1. Environmental preparation

Before we start, we need to install the NLP-related Python library:

pip install numpy pandas scikit-learn nltk spacy transformers torch tensorflow
  • numpy and pandas are used for data processing
  • scikit-learn for feature engineering and evaluation
  • nltk and spacy are used for text preprocessing
  • transformers provide pre-trained NLP models
  • torch and tensorflow are used for deep learning modeling

2. Data preparation

We take the IMDB movie review dataset as an example, an NLP task for sentiment analysis (emotion classification).

import pandas as pd
from sklearn.model_selection import train_test_split

# Read IMDB dataseturl = "/~amaas/data/sentiment/aclImdb_v1."
df = pd.read_csv("IMDB ")  # The dataset needs to be downloaded and stored in advance
# Divide datasetstrain_texts, test_texts, train_labels, test_labels = train_test_split(df['review'], df['sentiment'], test_size=0.2, random_state=42)

# Convert label to numeric valuetrain_labels = train_labels.map({'positive': 1, 'negative': 0})
test_labels = test_labels.map({'positive': 1, 'negative': 0})

3. Text preprocessing

1. Clean up the text

In NLP tasks, we usually need to remove HTML tags, punctuation marks, stop words, etc.

import re
import nltk
from  import stopwords
from  import word_tokenize

('stopwords')
('punkt')

# Define text cleaning functiondef clean_text(text):
    text = (r'<.*?>', '', text)  # Remove HTML tags    text = (r'[^a-zA-Z]', ' ', text)  # Only keep letters    tokens = word_tokenize(())  # Partition    tokens = [word for word in tokens if word not in ('english')]  # Go to stop word    return ' '.join(tokens)

# Process datatrain_texts = train_texts.apply(clean_text)
test_texts = test_texts.apply(clean_text)

4. Feature Engineering

Before deep learning, we can extract text features using TF-IDF or Word2Vec.

1. TF-IDF

from sklearn.feature_extraction.text import TfidfVectorizer

vectorizer = TfidfVectorizer(max_features=5000)
X_train = vectorizer.fit_transform(train_texts)
X_test = (test_texts)

2. Word2Vec

Use gensim to train Word2Vec word vectors.

from  import Word2Vec

sentences = [() for text in train_texts]
word2vec_model = Word2Vec(sentences, vector_size=100, window=5, min_count=5, workers=4)
word2vec_model.save("")

5. Build an NLP model

1. Logical Regression

from sklearn.linear_model import LogisticRegression
from  import accuracy_score

model = LogisticRegression()
(X_train, train_labels)

# predictpreds = (X_test)
print("Logistic Regression Accuracy:", accuracy_score(test_labels, preds))

2. LSTM deep learning model

import torch
import  as nn
import  as optim
from  import DataLoader, TensorDataset

# Define the LSTM modelclass LSTMModel():
    def __init__(self, vocab_size, embedding_dim, hidden_dim, output_dim):
        super(LSTMModel, self).__init__()
         = (vocab_size, embedding_dim)
         = (embedding_dim, hidden_dim, batch_first=True)
         = (hidden_dim, output_dim)
    
    def forward(self, x):
        embedded = (x)
        _, (hidden, _) = (embedded)
        return ((0))

# HyperparametersVOCAB_SIZE = 5000
EMBEDDING_DIM = 100
HIDDEN_DIM = 128
OUTPUT_DIM = 1

model = LSTMModel(VOCAB_SIZE, EMBEDDING_DIM, HIDDEN_DIM, OUTPUT_DIM)

# Train the modelcriterion = ()
optimizer = ((), lr=0.001)

for epoch in range(5):
    ()
    optimizer.zero_grad()
    outputs = model((0, VOCAB_SIZE, (len(train_labels), 50)))
    loss = criterion((), (train_labels.values, dtype=))
    ()
    ()
    print(f"Epoch {epoch+1}, Loss: {()}")

6. Use pre-trained BERT model

from transformers import BertTokenizer, BertForSequenceClassification
from  import DataLoader

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
model = BertForSequenceClassification.from_pretrained('bert-base-uncased')

# Tokenize datatrain_encodings = tokenizer(list(train_texts), truncation=True, padding=True, max_length=512, return_tensors="pt")
test_encodings = tokenizer(list(test_texts), truncation=True, padding=True, max_length=512, return_tensors="pt")

# Convert to PyTorch Datasetclass IMDbDataset():
    def __init__(self, encodings, labels):
         = encodings
         = labels

    def __len__(self):
        return len()

    def __getitem__(self, idx):
        item = {key: val[idx] for key, val in ()}
        item["labels"] = ([idx])
        return item

train_dataset = IMDbDataset(train_encodings, list(train_labels))
test_dataset = IMDbDataset(test_encodings, list(test_labels))

# Training BERTtrain_loader = DataLoader(train_dataset, batch_size=8, shuffle=True)
optimizer = ((), lr=5e-5)

()
for epoch in range(3):
    for batch in train_loader:
        optimizer.zero_grad()
        outputs = model(**batch)
        loss = 
        ()
        ()
        print(f"Epoch {epoch+1}, Loss: {()}")

7. Model evaluation

from  import classification_report

()
preds = []
with torch.no_grad():
    for batch in test_dataset:
        output = model(**batch)
        ((, axis=1).numpy())

print(classification_report(test_labels, preds))

8. Deploy the model

Can be usedFastAPIDeploy the NLP model:

from fastapi import FastAPI
import torch

app = FastAPI()

@("/predict/")
def predict(text: str):
    encoding = tokenizer(text, return_tensors="pt", truncation=True, padding=True, max_length=512)
    with torch.no_grad():
        output = model(**encoding)
        pred = (, axis=1).item()
    return {"sentiment": "positive" if pred == 1 else "negative"}

run:

uvicorn main:app --reload

Summarize

This article introduces the complete implementation process of the NLP model:

  • Data preprocessing
  • Feature Engineering
  • Machine Learning Models
  • Deep Learning LSTM
  • BERT pre-trained model
  • Model deployment

You can choose the appropriate NLP solution according to your business needs.

Natural language processing is an important direction in the field of artificial intelligence, mainly studying how computers understand, generate and process human language. It can be divided into the following categories according to task type and method:

(1) Classification by task type

Category to sequence: For example, sentiment analysis, classifying texts as positive, negative, or neutral.

Sequence to category: For example, text classification, classifying text into a specific category.

Synchronized sequence to sequence: For example, machine translation, converting text from one language to another.

Asynchronous sequence to sequence: For example, a question-and-answer system generates answers based on questions.

(2) Classification by technical method

Traditional machine learning methods: rely on artificial feature engineering, such as support vector machine (SVM), Naive Bayes (NB), etc.

Deep learning method: Automatically learn text features by constructing deep neural network models, such as recurrent neural network (RNN), long and short-term memory network (LSTM), Transformer, etc.

Template-free approach: Learning based on large-scale corpus without predefined templates or rules.

(3) Classification by application field

Text analysis: including sentiment analysis, text classification, naming entity recognition, etc.

Speech processing: such as speech recognition and natural language generation.

Machine Translation: Convert text or voice from one language to another.

Natural language processing (NLP) occupies an extremely important position in the development of artificial intelligence (AI) and is a bridge connecting the human language world with the digital world. The following is its position and role in AI development:

1. Core technology of human-computer interaction

NLP empowers computers to understand and generate human language and is one of the key technologies to realize natural human-computer interaction. Through NLP, computers can understand human intentions and respond or perform tasks accordingly, thus greatly improving the efficiency and nature of human-computer interaction.

2. The driving force for the development of AI technology

NLP is one of the three pillars of artificial intelligence (the other two are machine learning and computer vision), and its development has promoted the intelligence level of AI systems. With the continuous advancement of deep learning technology, NLP's performance in tasks such as text classification, sentiment analysis, and machine translation has improved significantly, further expanding the scope of AI application.

3. Wide application scenarios

NLP technology has penetrated into various fields, including but not limited to:

Machine Translation: Helps people communicate across language barriers.

Sentiment analysis: used to analyze emotional tendencies in texts and help companies understand customer attitudes.

Intelligent customer service: quickly and accurately understand customer problems and provide solutions.

Information retrieval: Improve search engines' semantic understanding capabilities and optimize search results.

Healthcare: Automatic summary and disease diagnosis assistance for electronic medical records.

Financial field: Analyze market news and predict stock price trends.

4. The key links of multimodal fusion

With the development of AI technology, NLP will further integrate with other AI branches such as computer vision and speech recognition. For example, the combination of speech recognition and NLP allows intelligent voice assistants to better understand user instructions; multimodal learning enables smarter interactions by integrating visual, auditory and text information.

5. Accelerator of digital transformation in the industry

The application of NLP technology in various industries not only improves work efficiency, but also promotes the industry's digital transformation and intelligent upgrade. For example, in the field of education, intelligent tutoring systems provide personalized learning advice by understanding students' learning situation.

6. Potential for future development

In the future, NLP will continue to play an important role in the field of AI, including the development of cross-language models, multimodal information fusion, and the enhancement of human-computer collaboration capabilities. These innovations will further expand the application scope and service capabilities of NLP.

To sum up, NLP, as an important branch in AI development, not only promotes the development of AI at the technical level, but also has brought far-reaching impacts to human life and various industries in practical applications. It will still play an indispensable role in future development.

This is the article about the detailed explanation of the steps and code for Python to build NLP models. For more related content on Python to build NLP models, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!