Common methods and steps for preprocessing Python scikit-learn

Common methods and steps for data preprocessing

Data preprocessing is an important part of the data preparation stage. The main purpose is to convert the original data into a format suitable for machine learning models, and to deal with missing values, outliers, duplicate values, inconsistencies and other problems in the data. Data preprocessing can significantly improve the performance and accuracy of machine learning models.

Here are some common data preprocessing steps:

Missing value processing：
- Delete records containing missing values.
- Fill in the missing values with a certain statistical value (such as mean, median, mode).
- Use algorithms (such as K-nearest neighbors, decision trees, etc.) to predict missing values.
Outlier value detection and processing：
- Use statistical methods (such as Z-score, IQR rules) to detect outliers.
- Deciding on business needs is to delete, replace or retain outliers.
Data standardization/normalization：
- Standardization (Z-score standardization): Convert data into a distribution with a mean of 0 and a standard deviation of 1.
- Normalization (Min-Max normalization): Scales the data to the range of [0,1] or [-1,1].
Encoding categorical variables：
- One-Hot Encoding: Convert categorical variables to binary columns.
- Label Encoding: Convert categorical variables to integers.
- Ordinal Encoding: converts it into integers for ordered categorical variables, retaining order information.
Feature selection and dimension reduction：
- Use statistical testing, model weighting and other methods to select important features.
- Use PCA, t-SNE and other methods to reduce the dimensionality to reduce the complexity of the features.
Data transformation：
- Logarithmic transformation, Box-Cox transformation, etc., are used to stabilize variance or make the data closer to normal distribution.
- Polynomial feature generation, used to capture nonlinear relationships.
Data division：
- Divide the dataset into training sets, validation sets, and test sets to evaluate the performance and generalization capabilities of the model.
Handle unbalanced data：
- Oversample a few classes (such as the SMOTE algorithm).
- Undersampling most classes.
- Use synthetic sample techniques or cost-sensitive learning to deal with imbalance problems.
Text data preprocessing：
- Remove stop words, punctuation marks, and special characters.
- Stemming or lemmatization.
- Text vectorization, such as bag-of-word model, TF-IDF, etc.
Time series data preprocessing：
- Extraction of date and time characteristics, such as year, month, day, hour, etc.
- Stationary processing of time series, such as difference, logarithmic transformation, etc.
- Seasonal decomposition, trend decomposition, etc.

The specific steps of data preprocessing should be determined based on the characteristics of the data set, business requirements and the requirements of the selected model. The preprocessed data should better reflect potential data structures and patterns, thereby improving the predictive performance of machine learning models.

Example: Simple code for data preprocessing

Of course, here is a simple sample data preprocessing code that uses Python's Pandas library to process a hypothetical dataset. This example will cover some basic data preprocessing steps such as missing value processing, data standardization, and encoding categorical variables.

import pandas as pd
from  import StandardScaler, OneHotEncoder
import numpy as np

# Suppose we have a DataFrame containing missing values, numerical features, and categorical featuresdata = {
    'Age': [25, , 35, 45, 55],
    'Salary': [50000, 60000, , 80000, 90000],
    'Gender': ['Male', 'Female', 'Male', 'Male', 'Female'],
    'MaritalStatus': ['Married', 'Single', 'Married', 'Single', 'Married']
}

df = (data)

# Handle missing values: fill with meandf['Age'].fillna(df['Age'].mean(), inplace=True)
df['Salary'].fillna(df['Salary'].mean(), inplace=True)

# Data Standardization: Standardize Age and Salaryscaler = StandardScaler()
df[['Age', 'Salary']] = scaler.fit_transform(df[['Age', 'Salary']])

# Encoding categorical variables: Use single-hot encoding to process Gender and MaritalStatusencoder = OneHotEncoder(sparse=False)
encoded_gender = encoder.fit_transform(df[['Gender']])
encoded_marital = encoder.fit_transform(df[['MaritalStatus']])

# Convert the encoded data to DataFrame and add it to the original DataFramedf_gender = (encoded_gender, columns=[f"Gender_{str(i)}" for i in range(encoded_gender.shape[1])])
df_marital = (encoded_marital, columns=[f"MaritalStatus_{str(i)}" for i in range(encoded_marital.shape[1])])

# Delete the original classification column(['Gender', 'MaritalStatus'], axis=1, inplace=True)

# Merge encoded data columnsdf = ([df, df_gender, df_marital], axis=1)

print(df)

In this example, we first create a hypothetical dataset containing numerical features (Age, Salary) and categorical features (Gender, MaritalStatus), and this dataset contains some missing values. We then preprocessed the data as follows:

The missing values were filled with the mean.
useStandardScalerNumerical features are standardized.
useOneHotEncoderClassification characteristics are single-hot encoded.
The original classification column was deleted and the encoded column was added to the DataFrame.

Please note that this example is to show the basic steps of data preprocessing, and in actual applications, it may be necessary to make corresponding adjustments based on the characteristics of the data and business needs.

Detailed explanation of the main steps

1. Use statistical methods (such as Z-score, IQR rules) to detect outliers.

Outlier detection is an important step in data analysis because it can help us identify unreasonable data points that may arise due to data entry errors, measurement errors, or other abnormal causes. These outliers can negatively affect the data analysis results, so it is important to identify and process them.

Here are two commonly used statistical methods to detect outliers:

Z-score method：
Z-score is the distance between a measured value and the standard deviation of the entire dataset. For a given data point (x), its Z-score can be calculated by the following formula:
[ Z = \frac{x - \mu}{\sigma} ]
where (\mu) is the mean of the data and (\sigma) is the standard deviation of the data.

Typically, if the absolute value of Z-score of a data point is greater than 3 (or select another threshold, such as 2 or 3.5, depending on the context), the data point can be considered an outlier.

import numpy as np
from scipy import stats

# Sample datadata = ([1, 2, 3, 4, 5, 6, 7, 8, 9, 20])  # The last value is an outlier
# Calculate Z-scorez_scores = ((data))
print("z_scores:",z_scores)
# [1.07349008 0.87831007 0.68313005 0.48795004 0.29277002 0.09759001
#  0.09759001 0.29277002 0.48795004 2.6349302 ]

# Set the threshold, usually 2.5 is used as the standard, but it can be adjusted according to the actual situation.threshold = 2.5

# Detect outliersoutliers = (z_scores &gt; threshold)

print("Index of outliers:", outliers)
# (array([], dtype=int64),)
print("Outlier:", data[outliers])
# [20]

IQR Rules(Interquartile range rule):
IQR is the difference between the third quartile (Q3) and the first quartile (Q1) and is used to measure the degree of dispersion of the data. The IQR law defines a range defined by Q1 - 1.5 * IQR and Q3 + 1.5 * IQR. Any data point that falls outside this range can be considered an outlier.

Specifically, the calculation formula of IQR is:
[ IQR = Q3 - Q1 ]
The detection range of outliers is:
[ \text{lower limit} = Q1 - 1.5 \times IQR ]
[ \text{Upper Limit} = Q3 + 1.5 \times IQR ]

import numpy as np

# Sample datadata = ([1, 2, 3, 4, 5, 6, 7, 8, 9, 50])  # The last value is an outlier
# Calculate the quartile and IQRQ1 = (data, 25)
Q3 = (data, 75)
IQR = Q3 - Q1
# Set the threshold of the IQR rulelower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
print(f'lower_bound:{lower_bound}, upper_bound:{upper_bound}')
# lower_bound:-3.5, upper_bound:14.5

# Detect outliersoutliers = ((data &lt; lower_bound) | (data &gt; upper_bound))

print("Index of outliers:", outliers)
# (array([9], dtype=int64),)
print("Outlier:", data[outliers])
# [50]

When using these two methods, you need to pay attention to the following points:

Which method to choose depends on the distribution and characteristics of the data. The Z-score method assumes that the data is approximately normal distribution, while the IQR law is more robust for skewed or non-normal distribution data.
Thresholds (such as 3 for Z-score or 1.5 times IQR for IQR rules) are empirical and may need to be adjusted according to circumstances.
The detected outliers require further analysis to determine whether they are real anomalies or valid parts of the data. Not all out-of-range values are wrong, some may be reasonable extreme values.
When dealing with outliers, be careful to consider whether to delete or replace them. Sometimes, outliers may contain important information and should not be discarded easily.

2. Remove stop words, punctuation marks and special characters

When processing text data, removing stop words (such as "the", "yes", "in" and other common words), punctuation marks and special characters is a common preprocessing step. These elements usually do not contain information that contributes substantially to the meaning of text and may interfere with the performance of natural language processing or machine learning models.

Here is a simple Python example showing how to use itnltkLibrary (Natural Language Processing Toolkit) removes stop words from text, and how to remove punctuation and special characters using regular expressions:

import re
import nltk
from  import stopwords

# Make sure you have downloaded the stop word list('stopwords')

# Get a list of stop wordsstop_words = set(('chinese'))  # Suppose we are dealing with Chinese text
# Define a function to remove stop wordsdef remove_stopwords(text, stop_words):
    return ' '.join([word for word in () if word not in stop_words])

# Define a function to remove punctuation and special charactersdef remove_punctuation_and_special_chars(text):
    # Replace non-alphanumeric characters with regular expressions    return (r'[^\w\s]', '', text)

# Sample texttext = "This is. An example, text! It contains many stop words, punctuation marks and special characters."

# Remove punctuation and special characterscleaned_text = remove_punctuation_and_special_chars(text)

# Remove stop wordscleaned_text = remove_stopwords(cleaned_text, stop_words)

print(cleaned_text)

Please note that('chinese')A list of English stop words is provided. For Chinese, you may need to create a list of Chinese stop words yourself or use an existing Chinese NLP library (such as jieba) to provide Chinese stop words.

also,remove_punctuation_and_special_charsFunctions use regular expressions[^\w\s]To match any non-alphanumeric and non-blank characters and replace them with an empty string. This effectively removes punctuation and special characters.

If you are dealing with Chinese text, regular expressions may need to be adjusted according to the characteristics of Chinese characters. For example, if you want to remove all symbols other than Chinese characters, you can use[^\u4e00-\u9fa5]Regular expressions to match non-Chinese characters.

Finally, note that the removal of stop word is not always necessary or beneficial, depending on your specific task and model. In some cases, a stop word may contain contextual information useful to the model.

3. Standardization (Z-score standardization): convert data into a distribution with a mean of 0 and a standard deviation of 1.

Standardization (also known as Z-score standardization or Standard Score) is a commonly used data preprocessing technique that aims to convert data into a distribution with a mean of 0 and a standard deviation of 1. This approach is important for many machine learning algorithms because it can help the algorithm better process data at different scales and reduce the excessive impact of certain features on the results due to having a large numerical range.

The formula for Z-score standardization is as follows:

[ z = \frac{x - \mu}{\sigma} ]

in:

( z ) is the normalized value.
( x ) is the value in the original data.
( \mu ) is the mean of the original data.
( \sigma ) is the standard deviation of the original data.

In Python, you can useIn the libraryzscoreFunctions are used to perform Z-score standardization, or to implement the above formula manually. Here is an example of Z-score standardization using Pandas and NumPy:

import pandas as pd
import numpy as np

# Create a simple DataFrame as sample datadata = {'value': [10, 20, 30, 40, 50]}
df = (data)

# Calculate the mean and standard deviationmean = df['value'].mean()
std = df['value'].std()

# Apply Z-score standardized formulasdf['standardized_value'] = (df['value'] - mean) / std

print(df)

import pandas as pd
import numpy as np

# Create a simple DataFrame as sample datadata = {'value': [10, 20, 30, 40, 50]}
df = (data)

# Calculate the mean and standard deviationmean = df['value'].mean()
std = df['value'].std()

# Apply Z-score standardized formulasdf['standardized_value'] = (df['value'] - mean) / std

print(df)

value standardized_value
0 10 -1.264911
1 20 -0.632456
2 30 0.000000
3 40 0.632456
4 50 1.264911

Or useofzscorefunction:

from scipy import stats

# Use scipy's zscore function for normalizationdf['zscore_value'] = (df['value'])

print(df)

value zscore_value
0 10 -1.414214
1 20 -0.707107
2 30 0.000000
3 40 0.707107
4 50 1.414214

Both methods will get the same result, i.e. converting the original data into a distribution with a mean of 0 and a standard deviation of 1. Standardized data can be better used in machine learning models because they have the same scale, which helps prevent some features from having too much impact on the model.

4. Normalization (Min-Max normalization): Scaling the data to the range of [0,1] or [-1,1]

Normalization is another common data preprocessing technique used to scale data to a specific range, usually [0,1] or [-1,1]. Min-Max normalization is a simple normalization method that maps data values to specified ranges through linear transformations.

Min-Max normalized to the [0,1] range

The formula for normalizing data to the range [0,1] is as follows:

[ x’ = \frac{x - \text{min}}{\text{max} - \text{min}} ]

in:

( x’ ) is the normalized value.
( x ) is the value in the original data.
( \text{min} ) is the minimum value in the original data.
( \text{max} ) is the maximum value in the original data.

Min-Max normalized to the [-1,1] range

The formula for normalizing data to the [-1,1] range is slightly different, as shown below:

[ x’ = 2 \times \frac{x - \text{min}}{\text{max} - \text{min}} - 1 ]

Or use another formula:

[ x’ = \frac{x - (\text{max} + \text{min}) / 2}{(\text{max} - \text{min}) / 2} ]

Both formulas map the data to the range [-1,1].

In Python, you can use the NumPy library to easily implement Min-Max normalization. Here is a sample code:

import numpy as np

# Assume this is your datasetdata = ([10, 20, 30, 40, 50])

# Calculate the minimum and maximum valuesdata_min = (data)
data_max = (data)

# Min-Max normalized to the range [0,1]normalized_data_01 = (data - data_min) / (data_max - data_min)
print("Normalized to [0,1]:", normalized_data_01)

# Min-Max normalizes to the [-1,1] range (using the first formula)normalized_data_11 = 2 * (data - data_min) / (data_max - data_min) - 1
print("Normalized to [-1,1] (formula 1):", normalized_data_11)

# Or use the second formula to normalize to the [-1,1] rangenormalized_data_11_alt = (data - (data_max + data_min) / 2) / ((data_max - data_min) / 2)
print("Normalized to [-1,1] (formula 2):", normalized_data_11_alt)

Normalized to [0,1]: [0. 0.25 0.5 0.75 1. ]
Normalized to [-1,1] (formula 1): [-1. -0.5 0. 0.5 1. ]
Normalized to [-1,1] (formula 2): [-1. -0.5 0. 0.5 1. ]

This code first calculates the minimum and maximum values in the data, and then uses these values to normalize the data to the range of [0,1] or [-1,1]. Normalized data usually perform better in machine learning models because they are all scaled to a common range, which contributes to the stability and convergence speed of model training.

5. One-Hot Encoding: Convert categorical variables into binary columns.

One-Hot Encoding is a method of converting categorical variables or nominal variables into a format that is easy to utilize by machine learning algorithms. During the data processing and preparation process, some classification characteristics are often encountered, such as: color (red, green, blue), day of the week (Monday to Sunday), gender (male, female), etc. These categorical features are usually not directly used in machine learning models because they are not numerical, and most machine learning algorithms can only process numerical data.

The basic idea of one-hot encoding is to create a new binary column for each classification value. If the classification value in the original data is equal to the classification represented by the column, the column is 1, otherwise it is 0. In this way, each classification value is represented as a unique binary vector, so that the machine learning model can handle these classification features.

For example, suppose there is a feature called "color", which has three possible values: red, green, and blue. Through one-hot encoding, we can convert this feature into three binary columns:

Color_Red: If the color in the original data is red, the column is 1, otherwise it is 0.
Color_Green: If the color in the original data is green, the column is 1, otherwise it is 0.
Color_Blue: If the color in the original data is blue, the column is 1, otherwise it is 0.

In Python, you can use the pandas library or the scikit-learn libraryOneHotEncoderclass to perform single-hot encoding. Here is an example of using pandas for single-hot encoding:

import pandas as pd

# Suppose there is a DataFrame containing the categorical variable 'color'df = ({
    'color': ['red', 'green', 'blue', 'red', 'green']
})

# Use pandas' get_dummies method for single-hot encodingdf_onehot = pd.get_dummies(df, columns=['color'])

print(df_onehot)

The output will be a DataFrame similar to this:

Color_Red Color_Green Color_Blue
0 1 0 0
1 0 1 0
2 0 0 1
3 1 0 0
4 0 1 0

In this example,colorThis column is converted into three new columns:Color_Red、Color_GreenandColor_Blue, three classification values corresponding to the color respectively. In each row, only one column has a value of 1 and the rest is 0, indicating the corresponding color classification in the original data.

6. Label Encoding: Convert categorical variables to integers

Label Encoding is a simple method for converting categorical variables (also known as nominal variables or categorical variables) into integers. This approach is often used to convert disordered classification labels (such as color, name of the day of the week, literal representation of gender, etc.) into numerical formats that machine learning models can handle.

In tag encoding, each unique classification tag is assigned a unique integer. For example, if there are three classification labels "red", "green" and "blue", the label encoding may convert them to integers 0, 1, and 2, respectively (or any other integer mapping, the key is to keep the mapping consistent).

While label encoding can convert categorical variables into numerical values, it has an important limitation: it assumes that there is a sequential relationship between categories, which is incorrect in many cases. For example, there is no natural order between the colors "red", "green" and "blue", but if tag encoding is used, the model may misinterpret some sort of order or hierarchical relationship between these encoded integers.

Therefore, special care is required when using label encoding to ensure that the converted integers are not misunderstood by the model as being sequential. If the categorical variables are ordered (such as the rating rating "low", "medium", "high"), then the tag encoding is appropriate.

In Python, you can useIn-houseLabelEncoderclass to encode tags. Here is an example:

from  import LabelEncoder

# Create a tag encoderle = LabelEncoder()

# Suppose there is a list of categorical variablescategories = ['red', 'green', 'blue', 'red', 'green']

# Tag encoding of categorical variablesencoded_categories = le.fit_transform(categories)

print(encoded_categories)
# The output may be [2 1 0 2 1], and the specific integer depends on the mapping inside the encoder
# You can view the mapping of tags to integers through the classes_ attributeprint(le.classes_)
# Output ['blue', 'green', 'red'] or other order, depending on the order in which it appears in the data

Note that the results of tag encoding depend on the order in which tags appear in the data, so different data sets or different tag orders may result in different encoding results. Furthermore, if a category that does not appear in the training dataset appears in the test dataset, the tag encoder will not be able to properly handle these new categories unless it is properly processed (such as using the handle_unknown='ignore' parameter or predefined all possible categories).

For disordered categorical variables, one-hot encoding is usually recommended instead of tag encoding to avoid introducing unnecessary sequential relationships.

7. Ordinal Encoding: converts it into integers for ordered categorical variables, retaining order information

Ordinal Encoding is a coding method specific to ordered categorical variables. Similar to Label Encoding, sequential encoding also converts classification labels into integers. However, unlike tag encoding, sequential encoding is specifically used for those categorical variables with natural order, so the converted integers not only represent different categories, but also preserve the sequential relationship between categories.

For example, suppose we have an ordered categorical variable representing user satisfaction, with categories such as "very dissatisfied", "unsatisfied", "general", "satisfied", and "very satisfied". There is a clear sequential relationship between these categories, namely "very dissatisfied" < "unsatisfied" < "general" < "satisfied" < "very satisfied". In sequential encoding, we can map these categories to integers in order, such as maps as 0, 1, 2, 3 and 4.

In Python, sequential encoding can be achieved by custom mapping or using existing data preprocessing libraries. Here is a simple Python example showing how to manually encode sequentially:

# Category of Ordered Classification Variablessatisfaction_levels = ["Very dissatisfied", "Dissatisfied", "generally", "satisfy", "Very satisfied"]

# Custom sequential encoding mappingordinal_mapping = {level: index for index, level in enumerate(satisfaction_levels)}

print(ordinal_mapping)
# Output: {'Very Dissatisfied': 0, 'Very Dissatisfied': 1, 'General': 2, 'Very Dissatisfied': 3, 'Very Dissatisfied': 4}
# Sample datadata = ["Very dissatisfied", "satisfy", "generally", "Very satisfied", "Dissatisfied"]

# Encoding sequentiallyencoded_data = [ordinal_mapping[level] for level in data]

print(encoded_data)  # Output: [0, 3, 2, 4, 1]

In this example, we create a mapping from satisfaction level to integers and use this mapping to transform the sample data. Converted integer listencoded_dataThe order information of the original satisfaction level is retained.

When using machine learning models, it makes sense to use sequential encoding if the categorical variables are ordered and the sequential information is important for model prediction. However, it should be noted that if the model does not handle such sequential relationships well (such as some distance-based algorithms), other encoding methods may need to be considered, such as one-hot encoding.

Built-in method

In Python's machine learning library, there is no built-in function directly named "sequential encoding", because sequential encoding is usually implemented through simple mapping and does not require complex library functions. However, you can useIn-houseLabelEncoderTo implement sequential encoding, just make sure your categorical variables are ordered and you encode them in the correct order.

LabelEncoderEach unique tag is assigned an integer, usually according to the order in which the tags appear in the data. If the order of your ordered categorical variables is consistent with the order in which they appear in the data, you can use it directlyLabelEncoder. Otherwise, you may need to sort the categorical variables first and then useLabelEncoder, or manually create a mapping dictionary to implement sequential encoding.

Here is a useLabelEncoderExample of sequential encoding:

from  import LabelEncoder
import numpy as np

# Category of ordered categorical variables, sorted in ordercategories = (["Low", "middle", "high"])

# Create a tag encoderle = LabelEncoder()

# Encoding ordered categorical variablesencoded_categories = le.fit_transform(categories)

print(encoded_categories)  # Output: [0 1 2]
# For new data points, the same encoder can also be used for conversion.new_data = (["middle", "high", "Low"])
encoded_new_data = (new_data)
print(encoded_new_data)  # Output: [1 2 0]

In this example, we first create an ordered array of categories and then encode it using LabelEncoder. Since the categories are already ordered and appear in the order we want, the encoded integers retain the original order information.

If the orderly categorical variables you use are inconsistent with the order they appear in the dataset, you need to sort them first, or manually specify a mapping relationship to ensure the correct order encoding.

Note that while LabelEncoder is usually used for label encoding, it can also be used to implement sequential encoding as long as the category is ensured. If the library or framework you are using has specific sequential coding capabilities, please consult the relevant documentation for details. Different libraries and frameworks may have different implementation methods and naming conventions.

The above is the detailed content of common methods and steps for Python scikit-learn data preprocessing. For more information about Python data preprocessing, please pay attention to my other related articles!