SoFunction
Updated on 2025-03-05

The complete process of implementing NLP in Python

1. Install and import the necessary libraries

First, make sure that the necessary NLP libraries are installed:

pip install numpy pandas matplotlib scikit-learn nltk spacy

Then import the necessary Python libraries:

import numpy as np
import pandas as pd
import  as plt
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from  import accuracy_score, confusion_matrix
import nltk
from  import stopwords
from  import word_tokenize
import spacy

2. Text data preparation

In practical applications, you may need to get text data from files, databases, or web pages. Here we take a simple text dataset as an example:

# Sample text datadata = {
    'text': [
        "I love programming in Python.",
        "Python is a great language for machine learning.",
        "Natural language processing is fun!",
        "I enjoy solving problems using code.",
        "Deep learning and NLP are interesting fields.",
        "Machine learning and AI are revolutionizing industries."
    ],
    'label': [1, 1, 1, 0, 1, 0]  #1 means positive emotions, 0 means negative emotions}
 
df = (data)
print(df)

3. Text preprocessing

Text preprocessing is a key step in NLP and usually includes: word segmentation, stop word removal, stemming extraction, and lowercase.

3.1 Lowercase

Convert all letters in the text to lowercase to ensure vocabulary consistency.

# lowercasedf['text'] = df['text'].apply(lambda x: ())

3.2 Tokenization

A word participle is to divide a piece of text into individual words.

('punkt')  # Download punkt word segmenter 
# Partitiondf['tokens'] = df['text'].apply(word_tokenize)
print(df['tokens'])

3.3 Remove stop words

Stop words are common but do not carry actual information, such as "the", "is", "and", etc. We need to remove these words.

('stopwords')  # Download the stop word library 
stop_words = set(('english'))
 
# Remove stop wordsdf['tokens'] = df['tokens'].apply(lambda x: [word for word in x if word not in stop_words])
print(df['tokens'])

3.4 Stem extraction

Stem extraction is to restore a word to its basic form (stem). For example, restore "running" to "run".

from  import PorterStemmer
 
stemmer = PorterStemmer()
 
# Stem extractiondf['tokens'] = df['tokens'].apply(lambda x: [(word) for word in x])
print(df['tokens'])

4. Feature Extraction

Text data cannot be used directly in machine learning models, so it needs to be converted into digital features. The common feature extraction method is TF-IDF (Term Frequency-Inverse Document Frequency).

# Vectorize text using TF-IDFvectorizer = TfidfVectorizer()
 
# Convert text data to TF-IDF feature matrixX = vectorizer.fit_transform(df['text'])
 
# View the converted TF-IDF feature matrixprint(())

5. Division of training test datasets

Split the dataset into a training set and a test set, usually an 80% training set and a 20% test set.

# Divide the training set and the test setX_train, X_test, y_train, y_test = train_test_split(X, df['label'], test_size=0.2, random_state=42)
 
print(f"Training set size: {X_train.shape}")
print(f"Test set size: {X_test.shape}")

6. Training the model

We use the Naive Bayes model to train the data. Naive Bayes is a commonly used classification algorithm that is suitable for text classification tasks.

# Create and train a modelmodel = MultinomialNB()
(X_train, y_train)

7. Evaluate the model

After training the model, we need to use the test set to evaluate the performance of the model. The main evaluation indicators include accuracy and confusion matrix.

# Use test sets to predicty_pred = (X_test)
 
# Calculate accuracyaccuracy = accuracy_score(y_test, y_pred)
print(f"Model accuracy: {accuracy:.4f}")
 
# Show confusion matrixconf_matrix = confusion_matrix(y_test, y_pred)
print("Confusion matrix:")
print(conf_matrix)
 
# Visualize the confusion matrix(conf_matrix, cmap='Blues')
("Confusion Matrix")
('Predicted')
('True')
()
()

8. Model prediction

Use the trained model to predict new text data.

# New text datanew_text = ["I love learning about AI and machine learning."]
 
# Text preprocessingnew_text = [() for text in new_text]
new_tokens = [word_tokenize(text) for text in new_text]
new_tokens = [[(word) for word in tokens if word not in stop_words] for tokens in new_tokens]
new_text_clean = [' '.join(tokens) for tokens in new_tokens]
 
# Feature Extractionnew_features = (new_text_clean)
 
# predictprediction = (new_features)
print(f"Prediction tags: {prediction[0]}")

9. Summary

In this article, we showcase a complete NLP process including:

Text preprocessing: lowercase, word segmentation, stop word removal, stemming extraction.

Feature Extraction: Use TF-IDF to convert text into feature matrix.

Model training: Use Naive Bayes classifier for text classification.

Model evaluation: Use accuracy and confusion matrix to evaluate model performance.

Model prediction: Predicting new text.

This is a typical NLP process that can be expanded according to actual needs, adding more features, algorithms and tuning steps.

This is the article about the complete process of Python implementing NLP. For more related Python NLP content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!