SoFunction
Updated on 2025-03-02

Detailed explanation of the examples of data segmentation and cross-validation of sklearn and keras

When training deep learning models, the data set is usually divided into training sets and validation sets. Keras provides two methods to evaluate model performance:

Verification set using automatic segmentation

Verification Sets Using Manual Segmentation

one. Automatic segmentation

In Keras, a portion can be split from the data set as a validation set, and the performance of the model is evaluated in the validation set at each iteration (epoch).

Specifically, when calling () to train the model, the validation_split parameter can be used to specify the proportion of the verification set split from the data set.

# MLP with automatic validation set
from  import Sequential
from  import Dense
import numpy
# fix random seed for reproducibility
(7)
# load pima indians dataset
dataset = ("", delimiter=",")
# split into input (X) and output (Y) variables
X = dataset[:,0:8]
Y = dataset[:,8]
# create model
model = Sequential()
(Dense(12, input_dim=8, activation='relu'))
(Dense(8, activation='relu'))
(Dense(1, activation='sigmoid'))
# Compile model
(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Fit the model
(X, Y, validation_split=0.33, epochs=150, batch_size=10)

validation_split: A floating point number between 0 and 1, which is used to specify a certain proportion of data in the training set as the verification set. The validation set will not participate in training and test the metrics of the model after each epoch ends, such as loss function, accuracy, etc.

Note that validation_split is divided before shuffle, so if your data itself is ordered, you need to manually disrupt and then specify validation_split, otherwise the verification set sample may be uneven.

two. Manual segmentation

Keras allows manual specification of validation sets when training the model.

For example, use the train_test_split() function in the sklearn library to segment the data set, and then use the validation_data parameter to specify the previous split verification set when keras().

# MLP with manual validation set
from  import Sequential
from  import Dense
from sklearn.model_selection import train_test_split
import numpy
# fix random seed for reproducibility
seed = 7
(seed)
# load pima indians dataset
dataset = ("", delimiter=",")
# split into input (X) and output (Y) variables
X = dataset[:,0:8]
Y = dataset[:,8]
# split into 67% for train and 33% for test
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=seed)
# create model
model = Sequential()
(Dense(12, input_dim=8, activation='relu'))
(Dense(8, activation='relu'))
(Dense(1, activation='sigmoid'))
# Compile model
(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
# Fit the model
(X_train, y_train, validation_data=(X_test,y_test), epochs=150, batch_size=10)

three. K-fold cross validation

The data set is divided into k parts, and each round is trained with (k-1) parts and the remaining 1 part is used for verification. In this way, the k round is performed to obtain k models. Average the performance of k times as the overall performance of the algorithm. k is generally taken as a value of 5 or 10.

Advantages: It can evaluate the performance of the model on unknown data relatively robustly.

Disadvantages: The calculation complexity is high. Therefore, it may not be applicable when the data set is large, the model complexity is high, or the computing resources are not very abundant, especially when training deep learning models.

sklearn.model_selection provides KFold and RepeatedKFold, LeaveOneOut, LeavePOut, ShuffleSplit, StratifiedKFold, GroupKFold, TimeSeriesSplit and other variants.

StratifiedKFold used in the following example uses stratified sampling, which ensures that the proportion of each small dataset in each category after cutting is the same as that in the original dataset.

# MLP for Pima Indians Dataset with 10-fold cross validation
from  import Sequential
from  import Dense
from sklearn.model_selection import StratifiedKFold
import numpy
# fix random seed for reproducibility
seed = 7
(seed)
# load pima indians dataset
dataset = ("", delimiter=",")
# split into input (X) and output (Y) variables
X = dataset[:,0:8]
Y = dataset[:,8]
# define 10-fold cross validation test harness
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed)
cvscores = []
for train, test in (X, Y):
 # create model
  model = Sequential()
  (Dense(12, input_dim=8, activation='relu'))
  (Dense(8, activation='relu'))
  (Dense(1, activation='sigmoid'))
  # Compile model
  (loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
  # Fit the model
  (X[train], Y[train], epochs=150, batch_size=10, verbose=0)
  # evaluate the model
  scores = (X[test], Y[test], verbose=0)
  print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100))
  (scores[1] * 100)
print("%.2f%% (+/- %.2f%%)" % ((cvscores), (cvscores)))

Supplementary knowledge:Training set, validation set and test set

Training set:Used to train the parameters of the model by minimizing the objective function (loss function + regular term). When the objective function is minimized, training of the model is completed.

Verification Set:Used to select the order of the model. The corresponding order of the model with the smallest objective function is the final selected order of the model.

Note:

1. The verification assembly is repeatedly used during the training process. It is used as the evaluation criteria for selecting different models in machine learning, and as the evaluation criteria for selecting the number of network layers and the number of nodes per layer in deep learning.

2. The use of verification sets is not essential. If the number of layers and nodes of the network have been determined, this step is not needed.

Test set:Evaluate the generalization ability of the model. Evaluate its generalization ability based on the selected trained model.

Note:

The test set judges the generalization ability of the final trained model, and only once is judged.

The above detailed explanation of the data segmentation and cross-validation examples of sklearn and keras is all the content I share with you. I hope you can give you a reference and I hope you can support me more.