In the process of data mining, we often encounter some problems, such as: how to choose various classifiers, which classification algorithm to choose, whether it is SVM, Decision Tree, or KNN?How to optimize the parameters of the classifiers in order to get better classification accuracy?
Random Forest Classifier
Random Forest in English is Random Forest, the abbreviation is RF. it is a classifier that contains multiple decision trees, each subclassifier is a CART classification regression tree. So Random Forest can do both classification and regression.
- When it does the classification, the output result is the one with the most classification results from each subclassifier. It can be understood that each classifier does voting and takes the result of the one with the most votes.
- When it does a regression, the output is the average of the regression results for each CART tree.
In sklearn, we use RandomForestClassifier() to construct a random forest model, with some common constructor parameters in the function:
- n_estimators: the number of decision trees in the random forest, default is 10.
- criterion: the criterion for splitting the decision tree, the default is the Gini index (CART algorithm), you can also choose entropy (ID3 algorithm)
- max_depth: Maximum depth of the decision tree, default is None, no limit.
- n_jobs: number of CPU cores for fitting and prediction, default is 1
GridSearchCV Tuning of model parameters
Classification algorithms, we often need to tune the network parameters (corresponding to the construction parameters above), with the aim of getting better classification results. In fact, a classification algorithm has many parameters with a wide range of values, so how to tune them?
Python gives us a very useful tool, GridSearchCV, which is Python's parameter auto-search module that automatically decides on the optimal parameters.
We use GridSearchCV(estimator, param_grid, cv=None, scoring=None) to construct the auto-search module for the parameter, and there are some key parameters to note here:
- estimator: represents the classifier used, such as random forest, decision tree, SVM, KNN, etc.
- param_grid: represents the parameter to be optimized and its value.
- cv: the number of folds for cross validation, default is None, which means use three folds for cross validation.
- scoring: the accuracy of the evaluation criteria, the default is None, that is, you need to use the score function
For example, let's use the IRIS dataset that comes with sklearn to classify the IRIS data using random forests. If we want to know which value of n_estimators in the range of 1-10 gives the best classification results, we can write code as follows:
# -*- coding: utf-8 -*- # Classify IRIS datasets with RandomForest # Finding optimal parameters with GridSearchCV from import load_iris from import RandomForestClassifier from sklearn.model_selection import GridSearchCV rf = RandomForestClassifier() parameters = {"n_estimators": range(1, 11)} iris = load_iris() # Parameter tuning with GridSearchCV clf = GridSearchCV(estimator=rf, param_grid=parameters) # Classify iris datasets (, ) print("Optimal score: %.4lf" % clf.best_score_) print("Optimal parameters:", clf.best_params_)
Run results:
Optimal score: 0.9600
Optimal parameters: {'n_estimators': 3}
Pipelining with Pipeline
There are often multiple steps in classifying the data, such as normalizing the data first, also downscaling the data with PCA, and finally classifying the data with a classifier.
Python has a Pipeline pipeline mechanism. The pipeline mechanism is what allows us to create Pipeline pipeline jobs by listing each step in order. Each step is represented using ('name', step).
So we now use the Pipeline pipeline mechanism to do a bit of classification of the IRIS dataset using Random Forest. Firstly, we use StandardScaler method to normalize the data, i.e. we use data normalized to a normal distribution with mean 0 and variance 1, then we use PCA method to downsize the data, and finally we use Random Forest to classify the data, writing the code as follows:
from import load_iris from import PCA from import RandomForestClassifier from sklearn.model_selection import GridSearchCV from import Pipeline from import StandardScaler rf = RandomForestClassifier() parameters = {"randomforestclassifier__n_estimators": range(1, 11)} iris = load_iris() pipeline = Pipeline([ ('scaler', StandardScaler()), ('pca', PCA()), ('randomforestclassifier', rf) ]) clf = GridSearchCV(estimator=pipeline, param_grid=parameters) (, ) print("Optimal score: %.4lf" % clf.best_score_) print("Optimal parameters:", clf.best_params_)
The results of the run are as follows:
Optimal score: 0.9600
Optimal parameters: {'randomforestclassifier__n_estimators': 9}
To this article on the Python implementation of the Random Forest algorithm sample code is introduced to this article, more related Python Random Forest algorithm content, please search for my previous articles or continue to browse the following related articles I hope you will support me in the future more!