SoFunction
Updated on 2024-10-28

python machine base logistic regression and unsupervised learning

I. Logistic regression

1. Model saving and loading

After the model is trained, it can be saved directly, which requires the joblib library. Save in pkl format, binary, through the dump method. Load through the load method can be.

Install joblib: conda install joblib

Save: (rf, '')

Load: estimator = ('model path')

Predictions can be made by substituting the test set directly after loading.

2. Logistic regression principle

Logistic regression is a classification algorithm, but the criterion for that classification, after inputting by h(x), is transformed using a sigmoid function, and at the same time, depending on the threshold, one is able to output a number between 0-1 for different values of h(x). We consider this output between 0-1, as the probability. Assuming that the threshold is 0.5, then anything greater than 0.5 we consider to be 1, otherwise 0. Logistic regression is applicable to binary classification problems.

① Inputs for logistic regression

As can be seen, the input is still a linear regression model, which still has weights w, and eigenvalues x. Our goal is still to find the most appropriate w.

② sigmoid function

The function image is shown below:

The formula for this function is as follows:

z is the result of the regression, h(x), which is transformed by the sigmoid function so that the output is between 0 and 1 no matter what value z is. Then we need to choose the most appropriate weight w so that the probability of the output and the result obtained, can be as close as possible to the target value of the training set. Therefore, logistic regression also has a loss function, called the log-likelihood loss function. By minimizing it, the objective w can be found.

③ Loss function for logistic regression

The function images of the loss function at y = 1 and 0 are shown below:

From the above figure, it can be seen that if the category of true values is 1, the output given by h(x), the closer it is to 1, the smaller the loss function is and vice versa. Same is true when y=0. So accordingly, when the loss function is minimized, our goal is found.

④ Logistic regression characteristics

Logistic regression is also solved by gradient descent. For the mean square error, there is only one minimum, and there is no local minimum; however, for the log-likelihood loss, there may be multiple local minima, and there is no method that can completely solve the local minimum problem. Therefore, we can only try to avoid it by initializing randomly several times, and adjusting the learning rate. However, even if the final result is a local optimal solution, it is still a good model.

3. Logistic regression API

sklearn.linear_model.LogisticRegression

where penalty is the regularization method and C is the strength of the penalty.

4. Logistic regression case

①Overview of cases

In the given data, it is a combination of multiple features that determines whether a tumor is malignant or not.

② Specific Processes

Since the process of the algorithm is basically the same, the focus is on the processing of data and features, so we will not elaborate in this paper, the code is as follows:

Attention:

Instead of 0 and 1, the logistic regression has target values of 2 and 4, but they do not need to be processed and are automatically labeled as 0 and 1 in the algorithm

Once the algorithm has predicted, if you want to see the recall, you need to be careful to give names to the categories you are classified into, but you need to label them before you give the names. See the figure above. Otherwise the method does not know which is benign and which is malignant. The order of labeling needs to correspond well.

Generally speaking, whichever category has fewer samples, it will be judged according to which one. For example, if there are fewer malignant ones, then "what is the probability of determining that they are malignant" will be used to make the judgment.

5. Logistic regression summary

Applications: binary classification problems such as predicting click-through rates on advertisements and whether or not they are diseased

Advantage: suitable for scenarios where you need to get a classification probability

Disadvantage: logistic regression performance is not very good when the feature space is large (depends on hardware capabilities)

II. Unsupervised learning

Unsupervised learning is that no correct answer is given. That means there are no target values in the data, only feature values.

Principles of the -means clustering algorithm

Assuming that the categories of clustering are 3 categories, the process is as follows:

① Three samples are randomly drawn from the data as the three centroids of the category

② Calculate the distance of the remaining points from each of the three centers, and select the closest point as your marker. Form three groups

(iii) Calculate the average of these three clusters separately and compare the three averages with the three previous centroids. If they are the same, end the clustering, if they are different, use the three averages as the new cluster centroids and repeat step two.

-means API

Usually, clustering is done before classification. The samples are first clustered and labeled, and next time a new sample is available, it can be classified according to the criteria given by the clustering.

3. Clustering performance evaluation

①Performance Evaluation Principles

Simply put, it is the distance of each point in the class from the "points within the class" and the distance from the "points outside the class". The closer to the point inside the class, the better. The farther away from the point outside the class, the better.

If sc_i is less than 0, it means that the average distance of a_i is greater than the nearest other cluster. Clustering is not working well

If sc_i is larger, it means that the average distance of a_i is smaller than the nearest other clusters. Clustering works well

The value of the contour coefficient is between [-1,1], and the closer it is to 1, the better the cohesion and separation are.

② Performance Evaluation API

.silhouette_score

Clustering algorithms tend to converge to a local optimum, which can be solved by clustering multiple times.

This is the detailed content of python machine base logistic regression and unsupervised learning, more information about python machine learning logistic regression and unsupervised please pay attention to my other related articles!