SoFunction
Updated on 2024-10-30

An overview of the use of K-Means clustering in Python sklearn

initial recognition

k-means translates to the k-mean clustering algorithm, which aims to partition the sample into k clusters, and thiskthenKMeansThe most important parameter in then_clusters, which defaults to 8.

The simplest clustering is done below

import numpy as np
import  as plt
from  import KMeans
from  import make_blobs
X, y = make_blobs(1500)
fig = ()
for i in range(2):
    ax = fig.add_subplot(1,2,i+1)
    y = KMeans(i+2).fit_predict(X)
    (X[:, 0], X[:, 1], c=y)
()

Among them.yis the clustering result and its value indicates the corresponding positionXAffiliated class number.

The effect is shown in the figure, and for the following set of data, it is obviously best to divide it into two categories, but if theKMeans(used form a nominal expression)n_clustersSet it to 3, then it will be clustered into 3 categories.

aboveKMeansis a class thatsklearnThe function is also provided in the form of a call, which is used as follows

from  import k_means
cen, y, interia = k_means(X, 3)

Among them.cendenotes the center of mass of each class after clustering;yis the label after clustering;interiadenotes the sum of the mean square errors.

Initial value selection

existKMeansThe most important concept is the cluster, which is the kind of data that is partitioned; and each cluster has a very important point, the center of mass. After setting the number of clusters, it is also equivalent to determining the number of centers of mass, and theKMeansThe basic flow of the algorithm is

  • Select k points as the initial center of mass of k clusters
  • Calculate the distance of the sample to these k centers of mass (clusters) and assign it to the closest cluster
  • Calculate the mean value of each cluster and use this mean value to update the center of mass of the cluster

Repeat operations 2-3 above until the center of mass region is stabilized or the maximum number of iterations is reached.

As you can see from this process, theKMeansThere are at least two details of the algorithm that need to be considered, one is the initialization scheme and the other is the scheme for center-of-mass updating.

existKMeansclass ork_meansfunction, two initialization schemes for the center of mass are provided by the parameterinitto control

  • 'random': denotes the random generation of k centers of mass
  • 'k-means++': This is the default value, which is set by thekMeans++method to initialize the center of mass.

kMeans++The process of initializing the center of mass is as follows

  • Randomly select 1 point as the initial center of mass x 0
  • Calculate the distance from other points to the nearest center of mass
  • Assuming that there are n n n existing centers of mass already, then the point farther away from the current center of mass is chosen as the next center of mass x n x_n xn

Repeat steps 2 and 3 until the number of centers of mass reaches k k k .

If you wish to call thekMeans++function, then thekmeans_plusplus

small batch

sklearnoffersKMeansA variant ofMiniBatchKMeans, which can be randomly sampled in each training iteration, this small batch training process greatly reduces the computing time.

When the sample size is very large, the advantage of small batch KMeans is very obvious

from  import MiniBatchKMeans
import time
ys, xs = ([4,4])*6
cens = list(zip((-1), (-1)))
X, y = make_blobs(100000,centers=cens)
km = KMeans(16)
mbk = MiniBatchKMeans(16)
def test(func, value):
    t = ()
    func(value)
    print("Time-consuming.", ()-t)
test(km.fit_predict, X)
# Time consumed 3.2028110027313232
test(mbk.fit_predict, X)
# Time consumed 0.2590029239654541

The visible effect is very obvious, wherefit_predictcap (a poem)predictis similar, but does not return a value.km.fit_predict(X)After running it, it will change thekmhit the nail on the headlabels_attribute, which is the classification result

fig = ()
ax = fig.add_subplot(1,2,1)
(X[:,0], X[:,1], c=km.labels_, 
    marker='.', alpha=0.5)
ax = fig.add_subplot(1,2,2)
(X[:,0], X[:,1], c=mbk.labels_, 
    marker='.', alpha=0.5)
()

The results are shown in the figure, which shows that there is little difference between the KMeans algorithm and the KMeans algorithm for small batches from the results.

This article on Python sklearn in the use of K-Means clustering analysis of the article is introduced to this, more related to Python K-Means clustering content, please search for my previous articles or continue to browse the following related articles I hope you will support me more in the future!