When training deep learning models, the data set is usually divided into training sets and validation sets. Keras provides two methods to evaluate model performance:
Verification set using automatic segmentation
Verification Sets Using Manual Segmentation
one. Automatic segmentation
In Keras, a portion can be split from the data set as a validation set, and the performance of the model is evaluated in the validation set at each iteration (epoch).
Specifically, when calling () to train the model, the validation_split parameter can be used to specify the proportion of the verification set split from the data set.
# MLP with automatic validation set from import Sequential from import Dense import numpy # fix random seed for reproducibility (7) # load pima indians dataset dataset = ("", delimiter=",") # split into input (X) and output (Y) variables X = dataset[:,0:8] Y = dataset[:,8] # create model model = Sequential() (Dense(12, input_dim=8, activation='relu')) (Dense(8, activation='relu')) (Dense(1, activation='sigmoid')) # Compile model (loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # Fit the model (X, Y, validation_split=0.33, epochs=150, batch_size=10)
validation_split: A floating point number between 0 and 1, which is used to specify a certain proportion of data in the training set as the verification set. The validation set will not participate in training and test the metrics of the model after each epoch ends, such as loss function, accuracy, etc.
Note that validation_split is divided before shuffle, so if your data itself is ordered, you need to manually disrupt and then specify validation_split, otherwise the verification set sample may be uneven.
two. Manual segmentation
Keras allows manual specification of validation sets when training the model.
For example, use the train_test_split() function in the sklearn library to segment the data set, and then use the validation_data parameter to specify the previous split verification set when keras().
# MLP with manual validation set from import Sequential from import Dense from sklearn.model_selection import train_test_split import numpy # fix random seed for reproducibility seed = 7 (seed) # load pima indians dataset dataset = ("", delimiter=",") # split into input (X) and output (Y) variables X = dataset[:,0:8] Y = dataset[:,8] # split into 67% for train and 33% for test X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=seed) # create model model = Sequential() (Dense(12, input_dim=8, activation='relu')) (Dense(8, activation='relu')) (Dense(1, activation='sigmoid')) # Compile model (loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # Fit the model (X_train, y_train, validation_data=(X_test,y_test), epochs=150, batch_size=10)
three. K-fold cross validation
The data set is divided into k parts, and each round is trained with (k-1) parts and the remaining 1 part is used for verification. In this way, the k round is performed to obtain k models. Average the performance of k times as the overall performance of the algorithm. k is generally taken as a value of 5 or 10.
Advantages: It can evaluate the performance of the model on unknown data relatively robustly.
Disadvantages: The calculation complexity is high. Therefore, it may not be applicable when the data set is large, the model complexity is high, or the computing resources are not very abundant, especially when training deep learning models.
sklearn.model_selection provides KFold and RepeatedKFold, LeaveOneOut, LeavePOut, ShuffleSplit, StratifiedKFold, GroupKFold, TimeSeriesSplit and other variants.
StratifiedKFold used in the following example uses stratified sampling, which ensures that the proportion of each small dataset in each category after cutting is the same as that in the original dataset.
# MLP for Pima Indians Dataset with 10-fold cross validation from import Sequential from import Dense from sklearn.model_selection import StratifiedKFold import numpy # fix random seed for reproducibility seed = 7 (seed) # load pima indians dataset dataset = ("", delimiter=",") # split into input (X) and output (Y) variables X = dataset[:,0:8] Y = dataset[:,8] # define 10-fold cross validation test harness kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=seed) cvscores = [] for train, test in (X, Y): # create model model = Sequential() (Dense(12, input_dim=8, activation='relu')) (Dense(8, activation='relu')) (Dense(1, activation='sigmoid')) # Compile model (loss='binary_crossentropy', optimizer='adam', metrics=['accuracy']) # Fit the model (X[train], Y[train], epochs=150, batch_size=10, verbose=0) # evaluate the model scores = (X[test], Y[test], verbose=0) print("%s: %.2f%%" % (model.metrics_names[1], scores[1]*100)) (scores[1] * 100) print("%.2f%% (+/- %.2f%%)" % ((cvscores), (cvscores)))
Supplementary knowledge:Training set, validation set and test set
Training set:Used to train the parameters of the model by minimizing the objective function (loss function + regular term). When the objective function is minimized, training of the model is completed.
Verification Set:Used to select the order of the model. The corresponding order of the model with the smallest objective function is the final selected order of the model.
Note:
1. The verification assembly is repeatedly used during the training process. It is used as the evaluation criteria for selecting different models in machine learning, and as the evaluation criteria for selecting the number of network layers and the number of nodes per layer in deep learning.
2. The use of verification sets is not essential. If the number of layers and nodes of the network have been determined, this step is not needed.
Test set:Evaluate the generalization ability of the model. Evaluate its generalization ability based on the selected trained model.
Note:
The test set judges the generalization ability of the final trained model, and only once is judged.
The above detailed explanation of the data segmentation and cross-validation examples of sklearn and keras is all the content I share with you. I hope you can give you a reference and I hope you can support me more.