Detailed explanation of R language logistic regression, ROC curve and ten-fold cross-validation

The logistic regression template compiled by yourself is shared as a study note record. The data set uses 14 independent variables Xi and an australian data set of the dependent variable Y.

1. Test set and training set 3 and 7 groupings

australian &lt;- ("", = T,sep=",",header=TRUE)
#Read the number of rowsN = length(australian$Y)                                                                                         
#ind=1 is a row with a probability of 0.7, and ind=2 is a row with a probability of 0.3ind=sample(2,N,replace=TRUE,prob=c(0.7,0.3))
#Generate training set (here the training set and test set are randomly set to 70% of the original data set, 30%)aus_train &lt;- australian[ind==1,]
#Generate test setaus_test &lt;- australian[ind==2,]

2. Generate the model and export the results

#Generate logis model, use glm function#Use training set data to generate logis model, use glm function#family: Each response distribution (exponential distribution family) allows various correlation functions to correlate the mean and the linear predictor.  Commonly used family: binomal(link='logit')---The response variable obeys the binomial distribution, and the connection function is logit, that is, logistic regressionpre &lt;- glm(Y ~.,family=binomial(link = "logit"),data = aus_train)
summary(pre)
#The real value of the test setreal &lt;- aus_test$Y
The #predict function can obtain the predicted value of the model.  Here, the model object required for prediction is pre, the prediction object newdata is the test set, the type required for prediction is selected to adjust the interval of the response variable.predict. &lt;- (pre,type='response',newdata=aus_test)
#According to the probability of predicting value of 1, >0.5 returns 1, and the rest returns 0predict =ifelse(predict.&gt;0.5,1,0)
#A column of predicted values is added to the dataaus_test$predict = predict
#Export the result in csv format#(aus_test,"aus_test.csv")

3. Model inspection

##Model Testres &lt;- (real,predict)
#The number of rows of the training data, that is, the number of samplesn = nrow(aus_train) 
#Calculate the Cox-Snell fitting goodnessR2 &lt;- 1-exp((pre$deviance-pre$)/n) 
cat("Cox-Snell R2=",R2,"\n")
# Calculate the Nagelkerke goodness of fit, and we output this goodness of fit value at the endR2&lt;-R2/(1-exp((-pre$)/n)) 
cat("Nagelkerke R2=",R2,"\n")
Other indicators of the ##Model#residuals(pre) #residuals#coefficients(pre) #coefficients, the intercept term of the linear model and the slope of each independent variable, thus obtaining the linear equation expression.  Or write as coef(pre)#anova(pre) #variance

4. Accuracy and accuracy

true_value=aus_test[,15]
predict_value=aus_test[,16]
#Computing model accuracyerror = predict_value-true_value
accuracy = (nrow(aus_test)-sum(abs(error)))/nrow(aus_test) #Accuracy--Judge the proportion of correct number to total#Computing Precision, Recall and F-measure#Generally speaking, Precision is the number of searched entries (such as documents, web pages, etc.) that are accurate, and Recall is the number of searched entries.# combined with the confusion matrix, Precision calculates the proportion of "item that should be retrieved (TP)" among all retrieved items (TP+FP); Recall calculates the proportion of all retrieved items (TP) to all "item that should be retrieved (TP+FN)".precision=sum(true_value &amp; predict_value)/sum(predict_value) #The real value predicted values are 1 / the predicted values are 1 --- The number of correct information extracted / the number of information extractedrecall=sum(predict_value &amp; true_value)/sum(true_value) #The predicted values of the true value are 1 / The true value is 1 --- The number of correct information extracted / the number of information in the sampleThe conflicts between #P and R indicators sometimes occur, so they need to be considered comprehensively. The most common method is F-Measure (also known as F-Score)F_measure=2*precision*recall/(precision+recall) #F-Measure is a weighted reconciliation and average of Precision and Recall, and is a comprehensive evaluation indicator.#Output the above resultsprint(accuracy)
print(precision)
print(recall)
print(F_measure)
#Confusion matrix, display results are TP, FN, FP, and TN in sequencetable(true_value,predict_value)

Several ways to curve

#ROC curve# Method 1#("ROCR") 
library(ROCR) 
pred &lt;- prediction(predict.,true_value) #Predicted value (predicted value before 0.5 binary classification) and true valueperformance(pred,'auc')@ #AUC valueperf &lt;- performance(pred,'tpr','fpr')
plot(perf)
#Method 2#("pROC")
library(pROC)
modelroc &lt;- roc(true_value,predict.)
plot(modelroc, =TRUE, =TRUE,=TRUE, grid=c(0.1, 0.2),
 =c("green", "red"), =TRUE,
 ="skyblue", =TRUE) #Draw the ROC curve, mark the coordinates, and mark the value of AUC#Method 3, define by ROCTPR=rep(0,1000)
FPR=rep(0,1000)
p=predict.
for(i in 1:1000)
 { 
 p0=i/1000;
 ypred&lt;-1*(p&gt;p0) 
 TPR[i]=sum(ypred*true_value)/sum(true_value) 
 FPR[i]=sum(ypred*(1-true_value))/sum(1-true_value)
 }
plot(FPR,TPR,type="l",col=2)
points(c(0,1),c(0,1),type="l",lty=2)

6. Change the selection method of the test set and training set, and use ten-fold cross-validation

australian &lt;- ("", = T,sep=",",header=TRUE)
#Divide australian data into random ten points#("caret")
#Fixed grouping of folds functions(7)
require(caret)
folds &lt;- createFolds(y=australian$Y,k=10)
#Construct a for loop to obtain 10 cross-validation test set accuracy and training set accuracymax=0
num=0
for(i in 1:10){
 
 fold_test &lt;- australian[folds[[i]],] #Take folds[[i]] as the test set fold_train &lt;- australian[-folds[[i]],] # The remaining data is used as training set 
 print("***Group Number***")
 
 fold_pre &lt;- glm(Y ~.,family=binomial(link='logit'),data=fold_train)
 fold_predict &lt;- predict(fold_pre,type='response',newdata=fold_test)
 fold_predict =ifelse(fold_predict&gt;0.5,1,0)
 fold_test$predict = fold_predict
 fold_error = fold_test[,16]-fold_test[,15]
 fold_accuracy = (nrow(fold_test)-sum(abs(fold_error)))/nrow(fold_test) 
 print(i)
 print("***Test Set Accuracy***")
 print(fold_accuracy)
 print("*** Training Set Accuracy***")
 fold_predict2 &lt;- predict(fold_pre,type='response',newdata=fold_train)
 fold_predict2 =ifelse(fold_predict2&gt;0.5,1,0)
 fold_train$predict = fold_predict2
 fold_error2 = fold_train[,16]-fold_train[,15]
 fold_accuracy2 = (nrow(fold_train)-sum(abs(fold_error2)))/nrow(fold_train) 
 print(fold_accuracy2)
 
 
 if(fold_accuracy&gt;max)
 {
 max=fold_accuracy 
 num=i
 }
 
}
 
print(max)
print(num)
 
##The results can be seen，AccuracyaccuracyThe largest time ismax,Pickfolds[[num]]As a test set，The rest are training sets。

7. Obtain the accuracy of ten-fold cross-verification, and the results are exported

#The results of the maximum accuracy of the test set in tenfoldtesti &lt;- australian[folds[[num]],]
traini &lt;- australian[-folds[[num]],] # The remaining folds are used as training setprei &lt;- glm(Y ~.,family=binomial(link='logit'),data=traini)
predicti &lt;- (prei,type='response',newdata=testi)
predicti =ifelse(predicti&gt;0.5,1,0)
testi$predict = predicti
#(testi,"ausfold_test.csv")
errori = testi[,16]-testi[,15]
accuracyi = (nrow(testi)-sum(abs(errori)))/nrow(testi) 
 
#The accuracy of the training set in tenfoldpredicti2 &lt;- (prei,type='response',newdata=traini)
predicti2 =ifelse(predicti2&gt;0.5,1,0)
traini$predict = predicti2
errori2 = traini[,16]-traini[,15]
accuracyi2 = (nrow(traini)-sum(abs(errori2)))/nrow(traini) 
 
#Test set accuracy, take group i, and train set accuracyaccuracyi;num;accuracyi2
#(traini,"ausfold_train.csv")

Summarize

This is the article about R language logistic regression, ROC curve and ten-fold cross-validation. For more related R language logistic regression content, please search for my previous articles or continue browsing the related articles below. I hope everyone will support me in the future!