1. Install and import the necessary libraries
First, make sure that the necessary NLP libraries are installed:
pip install numpy pandas matplotlib scikit-learn nltk spacy
Then import the necessary Python libraries:
import numpy as np import pandas as pd import as plt from sklearn.model_selection import train_test_split from sklearn.feature_extraction.text import TfidfVectorizer from sklearn.naive_bayes import MultinomialNB from import accuracy_score, confusion_matrix import nltk from import stopwords from import word_tokenize import spacy
2. Text data preparation
In practical applications, you may need to get text data from files, databases, or web pages. Here we take a simple text dataset as an example:
# Sample text datadata = { 'text': [ "I love programming in Python.", "Python is a great language for machine learning.", "Natural language processing is fun!", "I enjoy solving problems using code.", "Deep learning and NLP are interesting fields.", "Machine learning and AI are revolutionizing industries." ], 'label': [1, 1, 1, 0, 1, 0] #1 means positive emotions, 0 means negative emotions} df = (data) print(df)
3. Text preprocessing
Text preprocessing is a key step in NLP and usually includes: word segmentation, stop word removal, stemming extraction, and lowercase.
3.1 Lowercase
Convert all letters in the text to lowercase to ensure vocabulary consistency.
# lowercasedf['text'] = df['text'].apply(lambda x: ())
3.2 Tokenization
A word participle is to divide a piece of text into individual words.
('punkt') # Download punkt word segmenter # Partitiondf['tokens'] = df['text'].apply(word_tokenize) print(df['tokens'])
3.3 Remove stop words
Stop words are common but do not carry actual information, such as "the", "is", "and", etc. We need to remove these words.
('stopwords') # Download the stop word library stop_words = set(('english')) # Remove stop wordsdf['tokens'] = df['tokens'].apply(lambda x: [word for word in x if word not in stop_words]) print(df['tokens'])
3.4 Stem extraction
Stem extraction is to restore a word to its basic form (stem). For example, restore "running" to "run".
from import PorterStemmer stemmer = PorterStemmer() # Stem extractiondf['tokens'] = df['tokens'].apply(lambda x: [(word) for word in x]) print(df['tokens'])
4. Feature Extraction
Text data cannot be used directly in machine learning models, so it needs to be converted into digital features. The common feature extraction method is TF-IDF (Term Frequency-Inverse Document Frequency).
# Vectorize text using TF-IDFvectorizer = TfidfVectorizer() # Convert text data to TF-IDF feature matrixX = vectorizer.fit_transform(df['text']) # View the converted TF-IDF feature matrixprint(())
5. Division of training test datasets
Split the dataset into a training set and a test set, usually an 80% training set and a 20% test set.
# Divide the training set and the test setX_train, X_test, y_train, y_test = train_test_split(X, df['label'], test_size=0.2, random_state=42) print(f"Training set size: {X_train.shape}") print(f"Test set size: {X_test.shape}")
6. Training the model
We use the Naive Bayes model to train the data. Naive Bayes is a commonly used classification algorithm that is suitable for text classification tasks.
# Create and train a modelmodel = MultinomialNB() (X_train, y_train)
7. Evaluate the model
After training the model, we need to use the test set to evaluate the performance of the model. The main evaluation indicators include accuracy and confusion matrix.
# Use test sets to predicty_pred = (X_test) # Calculate accuracyaccuracy = accuracy_score(y_test, y_pred) print(f"Model accuracy: {accuracy:.4f}") # Show confusion matrixconf_matrix = confusion_matrix(y_test, y_pred) print("Confusion matrix:") print(conf_matrix) # Visualize the confusion matrix(conf_matrix, cmap='Blues') ("Confusion Matrix") ('Predicted') ('True') () ()
8. Model prediction
Use the trained model to predict new text data.
# New text datanew_text = ["I love learning about AI and machine learning."] # Text preprocessingnew_text = [() for text in new_text] new_tokens = [word_tokenize(text) for text in new_text] new_tokens = [[(word) for word in tokens if word not in stop_words] for tokens in new_tokens] new_text_clean = [' '.join(tokens) for tokens in new_tokens] # Feature Extractionnew_features = (new_text_clean) # predictprediction = (new_features) print(f"Prediction tags: {prediction[0]}")
9. Summary
In this article, we showcase a complete NLP process including:
Text preprocessing: lowercase, word segmentation, stop word removal, stemming extraction.
Feature Extraction: Use TF-IDF to convert text into feature matrix.
Model training: Use Naive Bayes classifier for text classification.
Model evaluation: Use accuracy and confusion matrix to evaluate model performance.
Model prediction: Predicting new text.
This is a typical NLP process that can be expanded according to actual needs, adding more features, algorithms and tuning steps.
This is the article about the complete process of Python implementing NLP. For more related Python NLP content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!