initial recognition
k-means translates to the k-mean clustering algorithm, which aims to partition the sample into k clusters, and thisk
thenKMeans
The most important parameter in then_clusters
, which defaults to 8.
The simplest clustering is done below
import numpy as np import as plt from import KMeans from import make_blobs X, y = make_blobs(1500) fig = () for i in range(2): ax = fig.add_subplot(1,2,i+1) y = KMeans(i+2).fit_predict(X) (X[:, 0], X[:, 1], c=y) ()
Among them.y
is the clustering result and its value indicates the corresponding positionX
Affiliated class number.
The effect is shown in the figure, and for the following set of data, it is obviously best to divide it into two categories, but if theKMeans
(used form a nominal expression)n_clusters
Set it to 3, then it will be clustered into 3 categories.
aboveKMeans
is a class thatsklearn
The function is also provided in the form of a call, which is used as follows
from import k_means cen, y, interia = k_means(X, 3)
Among them.cen
denotes the center of mass of each class after clustering;y
is the label after clustering;interia
denotes the sum of the mean square errors.
Initial value selection
existKMeans
The most important concept is the cluster, which is the kind of data that is partitioned; and each cluster has a very important point, the center of mass. After setting the number of clusters, it is also equivalent to determining the number of centers of mass, and theKMeans
The basic flow of the algorithm is
- Select k points as the initial center of mass of k clusters
- Calculate the distance of the sample to these k centers of mass (clusters) and assign it to the closest cluster
- Calculate the mean value of each cluster and use this mean value to update the center of mass of the cluster
Repeat operations 2-3 above until the center of mass region is stabilized or the maximum number of iterations is reached.
As you can see from this process, theKMeans
There are at least two details of the algorithm that need to be considered, one is the initialization scheme and the other is the scheme for center-of-mass updating.
existKMeans
class ork_means
function, two initialization schemes for the center of mass are provided by the parameterinit
to control
-
'random'
: denotes the random generation of k centers of mass -
'k-means++'
: This is the default value, which is set by thekMeans++
method to initialize the center of mass.
kMeans++
The process of initializing the center of mass is as follows
- Randomly select 1 point as the initial center of mass x 0
- Calculate the distance from other points to the nearest center of mass
- Assuming that there are n n n existing centers of mass already, then the point farther away from the current center of mass is chosen as the next center of mass x n x_n xn
Repeat steps 2 and 3 until the number of centers of mass reaches k k k .
If you wish to call thekMeans++
function, then thekmeans_plusplus
。
small batch
sklearn
offersKMeans
A variant ofMiniBatchKMeans
, which can be randomly sampled in each training iteration, this small batch training process greatly reduces the computing time.
When the sample size is very large, the advantage of small batch KMeans is very obvious
from import MiniBatchKMeans import time ys, xs = ([4,4])*6 cens = list(zip((-1), (-1))) X, y = make_blobs(100000,centers=cens) km = KMeans(16) mbk = MiniBatchKMeans(16) def test(func, value): t = () func(value) print("Time-consuming.", ()-t) test(km.fit_predict, X) # Time consumed 3.2028110027313232 test(mbk.fit_predict, X) # Time consumed 0.2590029239654541
The visible effect is very obvious, wherefit_predict
cap (a poem)predict
is similar, but does not return a value.km.fit_predict(X)
After running it, it will change thekm
hit the nail on the headlabels_
attribute, which is the classification result
fig = () ax = fig.add_subplot(1,2,1) (X[:,0], X[:,1], c=km.labels_, marker='.', alpha=0.5) ax = fig.add_subplot(1,2,2) (X[:,0], X[:,1], c=mbk.labels_, marker='.', alpha=0.5) ()
The results are shown in the figure, which shows that there is little difference between the KMeans algorithm and the KMeans algorithm for small batches from the results.
This article on Python sklearn in the use of K-Means clustering analysis of the article is introduced to this, more related to Python K-Means clustering content, please search for my previous articles or continue to browse the following related articles I hope you will support me more in the future!