After the establishment of the regression model, we also need to verify the appropriateness of the model, in other words, whether the model we established can really represent the existing relationship between the dependent variable and the independent variable, this verification standard is generally chosen as the goodness-of-fit.
Goodness of fit is the degree to which the regression equation fits the observations. The statistic used to measure the goodness of fit is the coefficient of determination, R^2, which can take values in the range of [0, 1].The closer the value of R^2 is to 1, the better the regression equation fits the observations; conversely, the closer the value of R^2 is to 0, the worse the regression equation fits the observations.
The goodness of fit problem has not yet found a unified standard to say that greater than how much represents the accuracy of the model, the general default is greater than 0.8 can be
Formula for goodness-of-fit: R^2 = 1 - RSS/TSS
Note: RSS sum of squared deviations; TSS total sum of squares.
Before understanding the formula for goodness-of-fit, a few concepts need to be clearly understood: overall sum of squares, sum of squares of deviations, and regression sum of squares.
I. Overall sum of squares, sum of squares of deviations, regression sum of squares
Regression sum of squares ESS, residual sum of squares RSS, overall sum of squares TSS
TSS (Total Sum of Squares) represents the sum of the squares of the deviations of the actual values from the expected values, representing the total degree of variation of the variable
ESS (Explained Sum of Squares) denotes the sum of squares of the deviations of the predicted values from the expected values, representing the degree of variation in the variables owned by the predictive model
RSS (Residual Sum of Squares) represents the sum of the squares of the deviations of the actual values from the predicted values, representing the unknown degree of variation of the variable.
The formula for each sum of squares is given below:
II. Goodness-of-fit
Continuing from the previous section, we can see that we take the sum of the squared deviations of the actual values from the expected values as the total degree of change in the overall variable, and this degree of change is what we are modeling for, and we are modeling to simulate this degree of change.
After modeling, the total degree of variation (TSS) of the overall variable can be divided into two parts: the degree of variation simulated by the model (ESS) and the degree of variation of the unknown (RSS)
In general, the higher the degree of change in the variable owned by the predictive model as a percentage of the total degree of change, the more accurate the model is represented, and when RSS = 0, it means that the model is able to fully simulate the total change in the variable.
Going back to the goodness-of-fit formula at the beginning of the article: r^2 = 1 - RSS/TSS . Isn't that very well understood!
Assuming R^2 = 0.8, this means that we have modeled a degree of variability that simulates 80% of the total degree of variability, leaving 20% of the variability as unknown.
III. Examples
For students, it is now important to explore whether there is a relationship between students' academic performance and a single study period, given two sets of data as follows:
'Study time': [0.50,0.75,1.00,1.25,1.50,1.75,1.75, 2.00,2.25,2.50,2.75, 3.00,3.25,3.50,4.00,4.25,4.50,4.75,5.00,5.50].
'Score':[10,22,13,43,20,22,33,50,62,48,55,75,62,73,81,76,64,82,90,93]
Common sense understanding, the longer you study, the higher your score will generally be, the two are positively related, because there is just one independent variable, directly with sklearn, calculate the intercept and slope can be
import pandas as pd import numpy as np import as plt from pandas import DataFrame,Series from sklearn.cross_validation import train_test_split from sklearn.linear_model import LinearRegression #Creating a data set examDict = {'Study time':[0.50,0.75,1.00,1.25,1.50,1.75,1.75, 2.00,2.25,2.50,2.75,3.00,3.25,3.50,4.00,4.25,4.50,4.75,5.00,5.50], 'Score':[10,22,13,43,20,22,33,50,62, 48,55,75,62,73,81,76,64,82,90,93]} # Convert to DataFrame data format examDf = DataFrame(examDict) #examDf # Plotting scatterplots (examDf.mark,examDf.Duration of study,color = 'b',label = "Exam Data") # Add graph labels (x-axis, y-axis) ("Hours") ("Score") #Display Image () # Split the original dataset into training and test sets exam_X = examDf.Duration of study exam_Y = examDf.mark X_train,X_test,Y_train,Y_test = train_test_split(exam_X,exam_Y,train_size=0.8) #X_train is the training data label, X_test is the test data label, exam_X is the sample features, exam_y is the sample label, train_size is the training data proportion print("Raw Data Characterization:",exam_X.shape, ", training data features:",X_train.shape, ", Test Data Characterization:",X_test.shape) print("Raw Data Labeling:",exam_Y.shape, ", training data label:",Y_train.shape, ", test data label:",Y_test.shape) model = LinearRegression() #For modeling errors we need to reshape our training set to achieve the function's desired requirements # (X_train,Y_train) #reshape if rows=-1 allows us to change the number of columns in our array to automatically form a new array according to the size of the array. #Because the model needs a 2D array to fit but there is only one feature here so it needs reshape to convert to a 2D array. X_train = X_train.(-1,1) X_test = X_test.(-1,1) (X_train,Y_train) a = model.intercept_# Intercept b = model.coef_# Regression coefficient print("Line of best fit:intercept",a,", regression coefficient:",b)
Next, calculate the goodness-of-fit and see, the goodness-of-fit is 0.83, which meets the requirement
# Goodness-of-fit with training set to verify that the regression equation is reasonable def get_lr_stats(x, y, model): message0 = ' The one-way linear regression equation is: '+'\ty' + '=' + str(model.intercept_)+' + ' +str(model.coef_[0]) + '*x' from scipy import stats n = len(x) y_prd = (x) Regression = sum((y_prd - (y))**2) # Regression sum of squares Residual = sum((y - y_prd)**2) # Residual sum of squares total = sum(((y))**2) # Overall sum of squares R_square = 1-Residual / total # Correlation coefficient R^2 message1 = ('correlation coefficient(R^2): ' + str(R_square) + ';' + '\n'+ 'global sum of squares(TSS): ' + str(total) + ';' + '\n') message2 = ('regression sum of squares(RSS): ' + str(Regression) + ';' + '\nsum of squares of the residuals(ESS): ' + str(Residual) + ';' + '\n') return print(message0 +'\n' +message1 + message2 ) get_lr_stats(X_train,Y_train,model)
If needed, you can draw all the points and the regression line to visualize it
# Predicted values for training data y_train_pred = (X_train) # Plotting the line of best fit: the labels use the extreme predicted values from the training dataset X_train_pred = [min(X_train),max(X_train)] y_train_pred = [a+b*min(X_train),a+b*max(X_train)] (X_train_pred, y_train_pred, color='green', linewidth=3, label="best line") # Test Data Scatter Plot (X_test, Y_test, color='red', label="test data") (X_train, Y_train, color="blue", label="train data") #Add icon labels (loc=2) ("Hours") ("Score") #Display Image ("") () # Calculate the goodness of fit score = (X_test,Y_test) print(score)
The above this python linear regression analysis model test criteria - goodness of fit details is all I have shared with you, I hope to give you a reference, and I hope you support me more.