Detailed explanation of the use of scikit-learn by Python machine learning library

Preface

scikit-learn is one of the most popular machine learning libraries in Python. It provides a wide variety of machine learning algorithms and tools, including classification, regression, clustering, dimensionality reduction, etc.

The advantages of scikit-learn are:

Simple and easy to use: The scikit-learn interface is simple and easy to understand, allowing users to easily start with machine learning. Unified API: The scikit-learn API is very unified, and the usage methods of various algorithms are basically the same, making learning and use more convenient.
A large number of machine learning algorithms have been implemented: scikit-learn implements various classic machine learning algorithms and provides a wealth of tools and functions, making debugging and optimization of algorithms easier.
Open Source Free: scikit-learn is completely open source and free, and anyone can use and modify its code.
Efficient and stable: scikit-learn implements a variety of efficient machine learning algorithms that can handle large-scale datasets and perform well in terms of stability and reliability. scikit-learn is very suitable for introductory machine learning because the API is very unified and the model is relatively simple. My recommendation here is to combine official documents for learning, not only are there any introduction to the scope of application of each model and code examples.

scikit-learn official website address

Linear Regression Model-LinearRegression

The LinearRegression model is a linear regression-based model suitable for solving the prediction problem of continuous variables. The basic idea of this model is to establish a linear equation, model the relationship between independent variables and dependent variables into a straight line, and use training data to fit the straight line, thereby finding the coefficients of the linear equation, and then using this equation to predict the test data.

The LinearRegression model is suitable for problems with linear relationships between independent variables and dependent variables, such as house price prediction, sales prediction, user behavior prediction, etc. Of course, when the relationship between independent and dependent variables is nonlinear, the LinearRegression model will perform poorly. At this time, polynomial regression, ridge regression, Lasso regression and other methods can be used to solve the problem.

Prepare the dataset

After putting aside other factors, there is a certain linear relationship between study time and academic performance. Of course, the study time here refers to effective study time, which is manifested as the study time increases. So we prepare a data set of study time and grades. The data in the dataset is as follows:

Study time, score
0.5,15
0.75,23
1.0,14
1.25,42
1.5,21
1.75,28
1.75,35
2.0,51
2.25,61
2.5,49

Using LinearRegression

Identify features and goals

Between learning time and grades, learning time is the characteristic, that is, independent variable; grades are labels, that is, dependent variables, so we need to extract features and labels in the prepared learning time and grade dataset.

import pandas as pd
import numpy as np
from  import r2_score, mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
# Read the CSV data file for study time and gradesdata = pd.read_csv('data/study_time_score.csv')
# Learning time for extracting data featuresX = data['Study time']
# Extract data target (label) scoresY = data['Fraction']

Dividing training sets and test sets

After the feature and label data are ready, use scikit-learn's LinearRegression for training, and divide the data set into a training set and a test set.

"""
Dividing feature data and target data into test sets and training sets
passtest_size=0.25Divide 25% of the data into test sets
"""
X_train, X_test, Y_train, Y_test = train_test_split(X, Y, test_size=0.25, random_state=0)
x_train = X_train.(-1, 1)
(x_train, Y_train)

Select the model and fit the data

After preparing the test set and training set, we can select the appropriate model to fit the training set so that we can predict the corresponding targets of other features.

# Select the model, select the model as LinearRegressionmodel = LinearRegression()
# In Scikit-learn, the input to the machine learning model must be a two-dimensional array.  We need to convert a one-dimensional array into a two-dimensional array to be used in the model.x_train = X_train.(-1, 1)
# Make fitting(x_train, Y_train)

Get the model parameters

Since the data set only contains two types of learning time and grades, it is a very simple linear model, and the mathematical formula behind it is y=ax+b, where the y dependent variable is grade, and the x independent variable is also learning time.

"""
 Output model key parameters
 Intercept: Intercept, i.e. b
 Coefficients: variable weights, i.e. a
 """
print('Intercept:', model.intercept_)
print('Coefficients:', model.coef_)

Backtest

The above fitting model only uses test set data. Next, we need to use test set data to perform a backtest of the fitting of the model. After fitting using the training set, we can predict the feature test set. By comparing the obtained target prediction results with the actual target value, we can obtain the fit of the model.

# Convert to a 2D array of n rows and 1 columnsx_test = X_test.(-1, 1)
# Make predictions and calculate scores on test setsY_pred = (x_test)
# Print test feature dataprint(x_test)
# Print the prediction results corresponding to the feature dataprint(Y_pred)
# Compare the prediction results with the actual target value corresponding to the original feature data to obtain the model fit# R2 (R-squared): The goodness of fit of the model, the value range is between 0 and 1. The closer it is to 1, the better the model fits the data.print("R2:", r2_score(Y_test, Y_pred))

Program running results
According to the above code, we need to determine the fit of the LinearRegression model, that is, whether these data are suitable for fitting using linear models. The running results of the program are as follows:

Prediction results:
[47.43726068 33.05457106 49.83437561 63.41802692 41.84399249 37.84880093
23.46611131 37.84880093 26.66226456 71.40841004 18.67188144 88.9872529
63.41802692 42.6430308 21.86803469 69.81033341 66.61418017 33.05457106
58.62379705 50.63341392 18.67188144 41.04495418 20.26995807 77.80071653
28.26034119 13.87765157 61.81995029 90.58532953 77.80071653 36.25072431
84.19302303]
R2: 0.8935675710322939

Summarize

The fit of the above model is 89%, and if you can accept about 10% of the error, you can use the LinearRegression model to make predictions. When the size of the training set is less than 25%, the fit of the model is slightly lower than 89%. Because the size of the data set and the size of the training set will affect the fitting degree of the model, you need to constantly try to find the parameter setting of the fitting effect.

This is the end of this article about the detailed explanation of the Python machine learning library scikit-learn. For more related Python scikit-learn content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!